回复: Re: [commons-compress] branch master updated: Update my(Peter Lee) personal information in pom

2020-03-16 Thread peterlee

Thank you! :)

Lee

On 2020/3/17 1:22:36, "Stefan Bodewig"  wrote:


welcome :-)

Stefan

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



回复: Re: [ANNOUNCE] Welcome Peter Lee (peterlee) as Apache Commons Committer

2020-03-16 Thread peterlee

Thank you all!

It's a great honor to be a part of Apache.

Peter Lee


On 2020/3/17 2:42:01, "Woonsan Ko"  wrote:


Congrats and welcome, Peter!

Woonsan

On Mon, Mar 16, 2020 at 1:50 PM Gary Gregory  wrote:


 Hi All,

 Please welcome Peter Lee (peterlee) as our latest Apache Commons Committer!

 Gary


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert
Another item: ObjectsHashIterative is marked as Signedness.SIGNED.

The computation is done using 32-bit integers. So the long output can be 
negative but the upper 32-bits are always either entirely 0 or entirely 1. I 
think this is a candidate for converting to an unsigned long and allowing it to 
be Signedness.UNSIGNED.

Alex



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert



> On 16 Mar 2020, at 18:58, Claude Warren  wrote:
> 
> First I think that the hasher.getBits( Shape ) should be renamed to
> iterator( Shape ).  It was poorly named at the start.

I can fix that.

> 
> By definition a Hasher knows how many items it is going to insert.
> 
> The Shape tells the hasher how many hash functions to apply to each item.

OK. This is may misunderstanding. There is a contract that the Hasher is 
expected to fulfil but it is just not recorded in the javadoc. I can update the 
docs to indicate that:

"A Hasher represents items of arbitrary byte size as a byte representation of 
fixed size (a hash). The hash for each item is created using a hash function; 
use of different seeds allows generation of different hashes for the same item. 
The hashes can be dynamically converted into the bit index representation used 
by a Bloom filter. The shape of the Bloom filter defines the number of indexes 
per item and the range of the indexes. The hasher functions to generate the 
correct number of indexes in the range required by the Bloom filter for each 
item it represents.

Note that the process of generating hashes and mapping them to a Bloom filter 
shape may create duplicate indexes. The hasher may generate fewer than the 
required number of hash functions per item if duplicates have been removed."

> The Shape number of items is how many items are expected to be in the final
> Bloom filter, it is more the expected value not a hard limit.

Yes. As discussed before this is not actually required for a Bloom filter to 
function, it is required to maintain the intended purpose of the filter when it 
was constructed.

> 
> Keeping in mind the possibility of hash collisions, I don't see a way to
> check that the Hasher has respected the number of functions.

It would require encapsulating the Hasher functionality inside the Bloom 
filter. That would require passing the hash function and/or hasher to the Bloom 
filter on construction. The BloomFilter interface would then be changed to not 
accept a hasher in the contains and merge methods but the raw byte[] 
representation of an object. Or it could accept the Object itself if you 
provide a method to convert the object to bytes.

Encapsulating the conversion of objects to the hash then to the indexes is how 
the BloomFilter has been implemented in Guava. The implementation there is much 
simpler. A BloomFilter is typed to accept objects of type T. It has three 
methods:

put(T)
putAll(BloomFilter)
mightContain(T)

Underneath it uses a Funnel which you specify which converts T to one or 
more primitives/byte[]/String that are passed to a Sink. The Sink accepts data 
which is dynamically fed through a hash function.

Pros:

- Simple encapsulation of adding items to a filter
- Dynamic hashing without large byte[] intermediate buffers

Cons:

- No configuration of the hash function
- You lose type safety if you want to add different types of items. You have to 
use T = Object.



> 
> The static hasher for example will not return duplicates so it might appear
> that it has not respected the number of functions.  In addition there is no
> indication from the hasher how many items it contains..

Yes. So we state that the hasher represents one or more items.

> 
> The inputs to the hash.builder are byte buffers that are fed to the hash
> algorithm.  They are inputs to that algorithm.  So primitive types would
> simply convert from the primitive type to its byte buffer representation.
> Is that what you meant?

I was unclear on the purpose of the Hasher.Builder. It seemed incomplete. If 
the builder is to add items then it seems strange to have:

with(byte property)
with(String property)

It also seems strange to throw 'IllegalStateException if the Hasher is locked’ 
without explaining what this means. Is the builder intended to be concurrent? 
What is ‘locked’? Etc.

The byte could not possibly represent many meaningful objects. The string is 
trivially converted to UTF-8 bytes (as is done in the DynamicHasher). Both 
these methods could be added to the interface as default methods or preferrably 
dropped as they are so trivial.

I changed the documentation to remove the encoding as UTF-8 requirement from 
the with(String) method. It seems like an implementation detail and a 
Hasher.Builder implementation can decide how to convert the String. It is 
faster to use UTF-16 bytes for instance. I understand UTF-8 is for 
cross-platform standard. But mandating that it has to be done is too 
restrictive IMO. It would be better as:

with(CharSequence, Charset)
withUnencoded(CharSequence)

I was interpreting the Hasher.Builder as a builder of a single byte[] for 
hashing where you would pass different primitive values or byte[] for the same 
Object you want to convert. This is essentially a ByteBuffer. But if it is to 
receive an entire object for each call then (a) it should be documented as 
such; (b) it should be simplified to just the byte[] method with 

Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren
First I think that the hasher.getBits( Shape ) should be renamed to
iterator( Shape ).  It was poorly named at the start.

By definition a Hasher knows how many items it is going to insert.

The Shape tells the hasher how many hash functions to apply to each item.
The Shape number of items is how many items are expected to be in the final
Bloom filter, it is more the expected value not a hard limit.

Keeping in mind the possibility of hash collisions, I don't see a way to
check that the Hasher has respected the number of functions.

The static hasher for example will not return duplicates so it might appear
that it has not respected the number of functions.  In addition there is no
indication from the hasher how many items it contains..

The inputs to the hash.builder are byte buffers that are fed to the hash
algorithm.  They are inputs to that algorithm.  So primitive types would
simply convert from the primitive type to its byte buffer representation.
Is that what you meant?

The hasher contract is that it will generate integers in the proper range
and use the proper number of hash functions for each item that was added to
the builder and that repeated calls to getBits(Shape) will return the same
values.

Did I misunderstand something?

Claude


On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert 
wrote:

>
> On 16/03/2020 07:57, Claude Warren wrote:
> > I made a quick pass at changing getHasher() to iterator().
>
> A look at the feasibility or have you started work on this? If so then
> I'll not start work on it as well.
>
> I changed master to return a boolean for the merge operations in
> BloomFilter. So the outstanding changes are to drop getHasher() from the
> BloomFilter interface in favour of an iterator, spliterator and a
> forEachBit method.
>
> > I think we can get rid of HasherBloomFilter as its purpose was really to
> > create a Bloom filter for temporary usage and it doesn't seem to be
> > required if we have a hasher that can be created from a Shape and a
> > function that creates an Iterator.
>
> I agree.
>
> One change that could be made is to clarify the contract between a
> Hasher and a BloomFilter. At present the Hasher can operate without a
> defined contract in this method:
>
> PrimitiveIterator.OfInt getBits(Shape shape)
>
> It should validate that it can generate indexes for the shape. But it
> doesn't have to. It could return unlimited indexes and they could be
> outside the number of bits of the BloomFilter.
>
> There does not appear to be any control anywhere on the number of hash
> functions generated by the Hasher. I would expect this test in the
> AbstractBloomFilterTest to pass:
>
>  @Test
>  public void hasherMergeTest() {
>  int n = 1;
>  int m = 10;
>  HashFunctionIdentity h = new
> HashFunctionIdentityImpl("provider", "name",
>  Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>  Hasher hasher = new Hasher() {
>  @Override
>  public boolean isEmpty() {
>  return false;
>  }
>  @Override
>  public HashFunctionIdentity getHashFunctionIdentity() {
>  return h;
>  }
>  @Override
>  public OfInt getBits(Shape shape) {
>  // Do not respect the shape number of hash functions
> but do respect
>  // the number of bits
>  return IntStream.range(0, m).iterator();
>  }
>  };
>  for (int k = 1; k < 5; k++) {
>  Shape shape = new Shape(h, n, m, k);
>  BloomFilter bf = createEmptyFilter(shape);
>  bf.merge(hasher);
>  assertEquals("incorrect cardinality", k, bf.cardinality());
>  }
>  }
>
> It currently does not as all the BloomFilters will not respect the Shape
> with which they were created, i.e. they disregard the number of hash
> functions in the Shape. So does the Hasher.
>
> I think some of the control should be returned to the BloomFilter. The
> Hasher would be reduced to a simple generator of data for the
> BloomFilter, for example:
>
>  PrimitiveIterator.OfInt getBits(int m);
>  PrimitiveIterator.OfInt getBits(int k, int m);
>  PrimitiveIterator.OfLong getBits();
>
> The BloomFilter then accept responsibility for converting the primitives
> to a suitable index and creating the correct number of hash functions
> (i.e. indexes).
>
> A merge operation with a BloomFilter then becomes:
>
> - check the Hasher is using the correct hash function identity
> - ask the Hasher for an iterator
> - set k bits in the filter using the iterator, mapping each to the range
> [0, m)
>
> The BloomFilter has then encapsulated its state and respects the Shape.
>
> The HashFuntion will convert byte[] to a long.
>
> The Hasher exists to convert anything to a byte[] format.
>
> This perhaps needs the Hasher.Builder to be revised to include more
> methods that accept all the primitive data 

Re: [ANNOUNCE] Welcome Peter Lee (peterlee) as Apache Commons Committer

2020-03-16 Thread Woonsan Ko
Congrats and welcome, Peter!

Woonsan

On Mon, Mar 16, 2020 at 1:50 PM Gary Gregory  wrote:
>
> Hi All,
>
> Please welcome Peter Lee (peterlee) as our latest Apache Commons Committer!
>
> Gary

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Alex Herbert



On 16/03/2020 07:57, Claude Warren wrote:

I made a quick pass at changing getHasher() to iterator().


A look at the feasibility or have you started work on this? If so then 
I'll not start work on it as well.


I changed master to return a boolean for the merge operations in 
BloomFilter. So the outstanding changes are to drop getHasher() from the 
BloomFilter interface in favour of an iterator, spliterator and a 
forEachBit method.



I think we can get rid of HasherBloomFilter as its purpose was really to
create a Bloom filter for temporary usage and it doesn't seem to be
required if we have a hasher that can be created from a Shape and a
function that creates an Iterator.


I agree.

One change that could be made is to clarify the contract between a 
Hasher and a BloomFilter. At present the Hasher can operate without a 
defined contract in this method:


PrimitiveIterator.OfInt getBits(Shape shape)

It should validate that it can generate indexes for the shape. But it 
doesn't have to. It could return unlimited indexes and they could be 
outside the number of bits of the BloomFilter.


There does not appear to be any control anywhere on the number of hash 
functions generated by the Hasher. I would expect this test in the 
AbstractBloomFilterTest to pass:


    @Test
    public void hasherMergeTest() {
    int n = 1;
    int m = 10;
    HashFunctionIdentity h = new 
HashFunctionIdentityImpl("provider", "name",

    Signedness.SIGNED, ProcessType.CYCLIC, 0L);
    Hasher hasher = new Hasher() {
    @Override
    public boolean isEmpty() {
    return false;
    }
    @Override
    public HashFunctionIdentity getHashFunctionIdentity() {
    return h;
    }
    @Override
    public OfInt getBits(Shape shape) {
    // Do not respect the shape number of hash functions 
but do respect

    // the number of bits
    return IntStream.range(0, m).iterator();
    }
    };
    for (int k = 1; k < 5; k++) {
    Shape shape = new Shape(h, n, m, k);
    BloomFilter bf = createEmptyFilter(shape);
    bf.merge(hasher);
    assertEquals("incorrect cardinality", k, bf.cardinality());
    }
    }

It currently does not as all the BloomFilters will not respect the Shape 
with which they were created, i.e. they disregard the number of hash 
functions in the Shape. So does the Hasher.


I think some of the control should be returned to the BloomFilter. The 
Hasher would be reduced to a simple generator of data for the 
BloomFilter, for example:


    PrimitiveIterator.OfInt getBits(int m);
    PrimitiveIterator.OfInt getBits(int k, int m);
    PrimitiveIterator.OfLong getBits();

The BloomFilter then accept responsibility for converting the primitives 
to a suitable index and creating the correct number of hash functions 
(i.e. indexes).


A merge operation with a BloomFilter then becomes:

- check the Hasher is using the correct hash function identity
- ask the Hasher for an iterator
- set k bits in the filter using the iterator, mapping each to the range 
[0, m)


The BloomFilter has then encapsulated its state and respects the Shape.

The HashFuntion will convert byte[] to a long.

The Hasher exists to convert anything to a byte[] format.

This perhaps needs the Hasher.Builder to be revised to include more 
methods that accept all the primitive data types. These are all 
converted to a single byte[] representation. Thus the Hasher.Builder is 
effectively a specification for a ByteBuffer. Once an object is 
decomposed into the byte[] it can be fed through the HashFunction with 
different seeds or using the cyclic method to create the iterator. The 
BloomFilter consumes the raw long output from the stream produced by the 
Hasher and sets k bits within the range m.


Alex



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [ANNOUNCE] Welcome Peter Lee (peterlee) as Apache Commons Committer

2020-03-16 Thread Rob Tompkins
Congrats Peter. Welcome!

-Rob

> On Mar 16, 2020, at 1:50 PM, Gary Gregory  wrote:
> 
> Hi All,
> 
> Please welcome Peter Lee (peterlee) as our latest Apache Commons Committer!
> 
> Gary


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



[ANNOUNCE] Welcome Peter Lee (peterlee) as Apache Commons Committer

2020-03-16 Thread Gary Gregory
Hi All,

Please welcome Peter Lee (peterlee) as our latest Apache Commons Committer!

Gary


Re: [commons-compress] branch master updated: Update my(Peter Lee) personal information in pom

2020-03-16 Thread Stefan Bodewig
welcome :-)

Stefan

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [BloomFilters] changes to BloomFilter

2020-03-16 Thread Claude Warren
I made a quick pass at changing getHasher() to iterator().

I think we can get rid of HasherBloomFilter as its purpose was really to
create a Bloom filter for temporary usage and it doesn't seem to be
required if we have a hasher that can be created from a Shape and a
function that creates an Iterator.

On Sun, Mar 15, 2020 at 6:08 PM Alex Herbert 
wrote:

> On Sun, 15 Mar 2020, 17:27 Claude Warren,  wrote:
>
> > We have spoken elsewhere about removing getHasher() and adding iterator()
> > In addition should we add forEachBit( IntConsumer )?I
>
>
> I was thinking the same. So we provide an iterator allowing failfast on the
> first index that fails a criteria, e.g. for contains, and a foreach
> allowing efficient receipt of all indexes.
>
> The only thing missing is whether we add a spliterator which has in its API
> the ability to specify DISTINCT and the exact size of the number of
> indexes. The spliterator can be a default method using the iterator to
> create it. An implementation can provide one if it wants.
>
>
>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > 
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren