date:20200318

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
Shape.getNumberOfBits() / 8.0 ) or it may be shorter.

I am building byte buffers of fixed length that is the maximum size that
any valid bf.getBits() * Long.BYTES I need to know
Shape.getNumberOfBytes().
The  conversion is required for some Bloom filter indexing techniques.

And while serialization is outside the scope of the library, it is only
reasonable that we provide enough information to allow developers to
serialize/deserialse the data.  For example BloomFilter allows you to get
either the long[] representation or the list of bit indexes (via OfInt) and
there are ways to reconstruct a BloomFilter if you were to write that out
and read it back.



On Wed, Mar 18, 2020 at 4:07 PM Alex Herbert 
wrote:

>
>
> > On 18 Mar 2020, at 14:39, Claude Warren  wrote:
> >
> >>> Shape Discussion:
> >>>
> >>> as for getNumberOfBytes() it should return the maximum number of bytes
> >>> returned by a getBits() call to a filter with this shape.  So yes, if
> > there
> >>> is a compressed internal representation, no it won't be that.  It is a
> >>> method on Shape so it should literally be Math.ceil( getNumberOfBits()
> /
> >>> 8.0 )
> >>>
> >>> Basically, if you want to create an array that will fit all the bits
> >>> returned by BloomFilter.iterator() you need an array of
> >>> Shape.getNumberOfBytes().  And that is actually what I use it for.
> >
> >> Then you are also mapping the index to a byte index and a bit within the
> > byte. So if you are doing these two actions then this is something that
> you
> > should control.
> >
> > BloomFilter.getBits returns a long[].  that long[] may be shorter than
> the
> > absolute number of bytes specified by Shape.  It also may be longer.
> >
> > If you want to create a copy of the byte[] you have to know how long it
> > should be.  The only way to determine that is from Shape, and currently
> > only if you do the Ceil() method noted above.  There is a convenience in
> > knowing how long (in bytes) the buffer can be.
>
> Copy of what byte[]?
>
> There is no method to create a byte[] for a BloomFilter. So no need for
> getNumberOfBytes().
>
> Are you talking about compressing the long[] to a byte[] by truncating the
> final long into 1-8 bytes?
>
> BloomFilter bf;
> long[] bits = bf.getBits();
> ByteBuffer bb = ByteBuffer.allocate(bits.length *
> Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
> Arrays.stream(bits).forEachOrdered(bb::putLong);
> byte[] bytes = bb.array();
> int expected = (int) Math.ceil(bf.getShape().getNumberOfBits() / 8.0);
> if (bytes.length != expected) {
> bytes = Arrays.copyOf(bytes, expected);
> }
>
> For a BloomFilter of any reasonable number of bits the storage saving will
> be small.
>
> Is this for serialisation? This is outside of the scope of the library.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

2020-03-18 Thread Gilles Sadowski

Hi.

2020-03-18 15:10 UTC+01:00, chentao...@qq.com :
> Hi,
> I have created a PR to show my aim:
> https://github.com/apache/commons-math/pull/126/files

Am I correct that the implementations of "ClustersPointExtractor"
modify the argument of the "extract" method?
If so, that seems quite unsafe.  I would not expect this behaviour
in a public API.

Unless I missed some point, I'd ask again that the API be reviewed
*before* implementing several features (such as those "extractors")
on top of something that does not look right.

Best regards,
Gilles

>
>>Hello.
>>
>>Le mer. 11 mars 2020 à 07:28, chentao...@qq.com  a écrit
>> :
>>>
>>> Hi all,
>>> The "EmptyClusterStrategy" in KMeansPlusPlusClusterer can be reused
>>> MiniBatchKMeansClusterer and other cluster altorithm.
>>> So I think the "EmptyClusterStrategy" should move out from
>>> KMeansPlusPlusClusterer(JIRA issue #MATH-1525).
>>> I am not sure if my design is good or not.
>>
>>I can't say either; please provide more context/explanation
>>about the excerpts below.
>>
>>> I think here should be a interface:
>>>
>>> Solution 1: Explicit indicate the usage by class name and function name.
>>> ```java
>>> @FunctionalInterface
>>> public interface ClusterBreeder {
>>>  T newCenterPoint((final
>>> Collection> clusters);
>>> }
>>
>>What is a "Breeder"?
>>This seems to further complicates the matter; what is a "center" (if there
>>can be old and new ones).
>
> I mean a method to create a new Cluster from exists clusters.
>
>>
>>Regards,
>>Gilles
>>
>>> ...
>>> // Implementations
>>> public LargestVarianceClusterPointBreeder implements ClusterBreeder{...}
>>> public MostPopularClusterPointBreeder implements ClusterBreeder{...}
>>> public FarthestPointBreeder implements ClusterBreeder{...}
>>> ...
>>> // Usage
>>> // KMeansPlusPlusClusterer.java
>>> public class KMeansPlusPlusClusterer extends
>>> Clusterer {
>>> ...
>>> private final ClusterBreeder clusterBreeder;
>>> public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>>final DistanceMeasure measure,
>>>final UniformRandomProvider random,
>>>final ClusterBreeder clusterBreeder) {
>>> ...
>>> this.clusterBreeder=clusterBreeder;
>>> }
>>> ...
>>> public List> cluster(final Collection points) {
>>> ...
>>> if (cluster.getPoints().isEmpty()) {
>>> if (clusterBreeder == null) {
>>> throw new
>>> ConvergenceException(LocalizedFormats.EMPTY_CLUSTER_IN_K_MEANS);
>>> } else {
>>> newCenter = clusterBreeder.newCenterPoint(clusters);
>>> }
>>> }
>>> ...
>>> }
>>> }
>>> ```
>>>
>>> Solution2: Declare a more generic interface:
>>> ```java
>>> @FunctionalInterface
>>> public interface ClustersPointFinder {
>>>  T find((final Collection>> extends Clusterable>> clusters);
>>> }
>>>
>>> ...
>>> // Implementations
>>> public LargestVarianceClusterPointFinder implements ClustersPointFinder
>>> {...}
>>> public MostPopularClusterPointFinder implements ClustersPointFinder {...}
>>> public FarthestPointFinder implements ClustersPointFinder {...}
>>> ```
>>>
>>> Thanks,
>>> -CT

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert




> On 18 Mar 2020, at 14:39, Claude Warren  wrote:
> 
>>> Shape Discussion:
>>> 
>>> as for getNumberOfBytes() it should return the maximum number of bytes
>>> returned by a getBits() call to a filter with this shape.  So yes, if
> there
>>> is a compressed internal representation, no it won't be that.  It is a
>>> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
>>> 8.0 )
>>> 
>>> Basically, if you want to create an array that will fit all the bits
>>> returned by BloomFilter.iterator() you need an array of
>>> Shape.getNumberOfBytes().  And that is actually what I use it for.
> 
>> Then you are also mapping the index to a byte index and a bit within the
> byte. So if you are doing these two actions then this is something that you
> should control.
> 
> BloomFilter.getBits returns a long[].  that long[] may be shorter than the
> absolute number of bytes specified by Shape.  It also may be longer.
> 
> If you want to create a copy of the byte[] you have to know how long it
> should be.  The only way to determine that is from Shape, and currently
> only if you do the Ceil() method noted above.  There is a convenience in
> knowing how long (in bytes) the buffer can be.

Copy of what byte[]?

There is no method to create a byte[] for a BloomFilter. So no need for 
getNumberOfBytes().

Are you talking about compressing the long[] to a byte[] by truncating the 
final long into 1-8 bytes?

BloomFilter bf;
long[] bits = bf.getBits();
ByteBuffer bb = ByteBuffer.allocate(bits.length * 
Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
Arrays.stream(bits).forEachOrdered(bb::putLong);
byte[] bytes = bb.array();
int expected = (int) Math.ceil(bf.getShape().getNumberOfBits() / 8.0);
if (bytes.length != expected) {
bytes = Arrays.copyOf(bytes, expected);
}

For a BloomFilter of any reasonable number of bits the storage saving will be 
small.

Is this for serialisation? This is outside of the scope of the library.


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

>> Shape Discussion:
>>
>> as for getNumberOfBytes() it should return the maximum number of bytes
>> returned by a getBits() call to a filter with this shape.  So yes, if
there
>> is a compressed internal representation, no it won't be that.  It is a
>> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
>> 8.0 )
>>
>> Basically, if you want to create an array that will fit all the bits
>> returned by BloomFilter.iterator() you need an array of
>> Shape.getNumberOfBytes().  And that is actually what I use it for.

>Then you are also mapping the index to a byte index and a bit within the
byte. So if you are doing these two actions then this is something that you
should control.

BloomFilter.getBits returns a long[].  that long[] may be shorter than the
absolute number of bytes specified by Shape.  It also may be longer.

If you want to create a copy of the byte[] you have to know how long it
should be.  The only way to determine that is from Shape, and currently
only if you do the Ceil() method noted above.  There is a convenience in
knowing how long (in bytes) the buffer can be.



On Wed, Mar 18, 2020 at 2:19 PM Claude Warren  wrote:

> We are getting to the point where there are a lot of options that
> determine which implementation is "best".  We could take a stab at creating
> a BloomFIlterFactory that takes a Shape as an argument and does a "finger
> in the air" guestimate of which implementation best fits.  Store values in
> long blocks or as integers in a list, that sort of thing.  Perhaps in a
> month or so when we really have some idea.
>
>
>
> On Wed, Mar 18, 2020 at 2:16 PM Claude Warren  wrote:
>
>> You don't need Iterator iterator() as we have forEachCount(
>> BitCountConsumer )
>>
>> I guess we need something like add( Iterator) or add(
>> Collection ) or add( Stream )
>>
>> It would be nice if we could have a BitCountProducer class that we could
>> just pass to an add() method.
>>
>>
>>
>>
>> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
>> wrote:
>>
>>>
>>>
>>> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
>>> >
>>> > On a slightly different note.  CountingBloomFilters have no way to
>>> perform
>>> > a reload.  All other bloom filters you can dump the bits and reload
>>> > (trivial) but if you preserve the counts from a bloom filter and want
>>> to
>>> > reload them you can't.  We need a constructor that takes the
>>> index,count
>>> > pairs somehow.
>>>
>>> Iterator ?
>>>
>>> Or foolproof:
>>>
>>> class IndexCount {
>>> final int index;
>>> final int count;
>>> // ...
>>> }
>>>
>>> Iterator
>>>
>>>
>>> The CountingBloomFilter already has a method forEachCount(…).
>>>
>>> I was reluctant to add some sort of iterator:
>>>
>>> Iterator iterator()
>>>
>>> But we could put in:
>>>
>>> Iterator iterator()
>>>
>>> It would be inefficient but at least it is fool-proof. The operation is
>>> unlikely to be used very often.
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> 
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> 
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

We are getting to the point where there are a lot of options that determine
which implementation is "best".  We could take a stab at creating a
BloomFIlterFactory that takes a Shape as an argument and does a "finger in
the air" guestimate of which implementation best fits.  Store values in
long blocks or as integers in a list, that sort of thing.  Perhaps in a
month or so when we really have some idea.



On Wed, Mar 18, 2020 at 2:16 PM Claude Warren  wrote:

> You don't need Iterator iterator() as we have forEachCount(
> BitCountConsumer )
>
> I guess we need something like add( Iterator) or add(
> Collection ) or add( Stream )
>
> It would be nice if we could have a BitCountProducer class that we could
> just pass to an add() method.
>
>
>
>
> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
> wrote:
>
>>
>>
>> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
>> >
>> > On a slightly different note.  CountingBloomFilters have no way to
>> perform
>> > a reload.  All other bloom filters you can dump the bits and reload
>> > (trivial) but if you preserve the counts from a bloom filter and want to
>> > reload them you can't.  We need a constructor that takes the index,count
>> > pairs somehow.
>>
>> Iterator ?
>>
>> Or foolproof:
>>
>> class IndexCount {
>> final int index;
>> final int count;
>> // ...
>> }
>>
>> Iterator
>>
>>
>> The CountingBloomFilter already has a method forEachCount(…).
>>
>> I was reluctant to add some sort of iterator:
>>
>> Iterator iterator()
>>
>> But we could put in:
>>
>> Iterator iterator()
>>
>> It would be inefficient but at least it is fool-proof. The operation is
>> unlikely to be used very often.
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>> For additional commands, e-mail: dev-h...@commons.apache.org
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> 
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

You don't need Iterator iterator() as we have forEachCount(
BitCountConsumer )

I guess we need something like add( Iterator) or add(
Collection ) or add( Stream )

It would be nice if we could have a BitCountProducer class that we could
just pass to an add() method.




On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert 
wrote:

>
>
> > On 18 Mar 2020, at 11:14, Claude Warren  wrote:
> >
> > On a slightly different note.  CountingBloomFilters have no way to
> perform
> > a reload.  All other bloom filters you can dump the bits and reload
> > (trivial) but if you preserve the counts from a bloom filter and want to
> > reload them you can't.  We need a constructor that takes the index,count
> > pairs somehow.
>
> Iterator ?
>
> Or foolproof:
>
> class IndexCount {
> final int index;
> final int count;
> // ...
> }
>
> Iterator
>
>
> The CountingBloomFilter already has a method forEachCount(…).
>
> I was reluctant to add some sort of iterator:
>
> Iterator iterator()
>
> But we could put in:
>
> Iterator iterator()
>
> It would be inefficient but at least it is fool-proof. The operation is
> unlikely to be used very often.
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

2020-03-18 Thread chentao...@qq.com

Hi, 
    I have created a PR to show my aim: 
https://github.com/apache/commons-math/pull/126/files 

>Hello.
>
>Le mer. 11 mars 2020 à 07:28, chentao...@qq.com  a écrit :
>>
>> Hi all,
>> The "EmptyClusterStrategy" in KMeansPlusPlusClusterer can be reused 
>>MiniBatchKMeansClusterer and other cluster altorithm.
>> So I think the "EmptyClusterStrategy" should move out from 
>>KMeansPlusPlusClusterer(JIRA issue #MATH-1525).
>> I am not sure if my design is good or not.
>
>I can't say either; please provide more context/explanation
>about the excerpts below.
>
>> I think here should be a interface:
>>
>> Solution 1: Explicit indicate the usage by class name and function name.
>> ```java
>> @FunctionalInterface
>> public interface ClusterBreeder {
>>  T newCenterPoint((final 
>>Collection> clusters);
>> }
>
>What is a "Breeder"?
>This seems to further complicates the matter; what is a "center" (if there
>can be old and new ones). 

I mean a method to create a new Cluster from exists clusters.

>
>Regards,
>Gilles
>
>> ...
>> // Implementations
>> public LargestVarianceClusterPointBreeder implements ClusterBreeder{...}
>> public MostPopularClusterPointBreeder implements ClusterBreeder{...}
>> public FarthestPointBreeder implements ClusterBreeder{...}
>> ...
>> // Usage
>> // KMeansPlusPlusClusterer.java
>> public class KMeansPlusPlusClusterer extends 
>> Clusterer {
>> ...
>> private final ClusterBreeder clusterBreeder;
>> public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>    final DistanceMeasure measure,
>>    final UniformRandomProvider random,
>>    final ClusterBreeder clusterBreeder) {
>> ...
>> this.clusterBreeder=clusterBreeder;
>> }
>> ...
>> public List> cluster(final Collection points) {
>> ...
>> if (cluster.getPoints().isEmpty()) {
>> if (clusterBreeder == null) {
>> throw new 
>>ConvergenceException(LocalizedFormats.EMPTY_CLUSTER_IN_K_MEANS);
>> } else {
>> newCenter = clusterBreeder.newCenterPoint(clusters);
>> }
>> }
>> ...
>> }
>> }
>> ```
>>
>> Solution2: Declare a more generic interface:
>> ```java
>> @FunctionalInterface
>> public interface ClustersPointFinder {
>>  T find((final Collection>extends Clusterable>> clusters);
>> }
>>
>> ...
>> // Implementations
>> public LargestVarianceClusterPointFinder implements ClustersPointFinder {...}
>> public MostPopularClusterPointFinder implements ClustersPointFinder {...}
>> public FarthestPointFinder implements ClustersPointFinder {...}
>> ```
>>
>> Thanks,
>> -CT
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>For additional commands, e-mail: dev-h...@commons.apache.org
>
>
-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert




> On 18 Mar 2020, at 11:14, Claude Warren  wrote:
> 
> On a slightly different note.  CountingBloomFilters have no way to perform
> a reload.  All other bloom filters you can dump the bits and reload
> (trivial) but if you preserve the counts from a bloom filter and want to
> reload them you can't.  We need a constructor that takes the index,count
> pairs somehow.

Iterator ?

Or foolproof:

class IndexCount {
final int index;
final int count;
// ...
}

Iterator


The CountingBloomFilter already has a method forEachCount(…).

I was reluctant to add some sort of iterator:

Iterator iterator()

But we could put in:

Iterator iterator()

It would be inefficient but at least it is fool-proof. The operation is 
unlikely to be used very often.



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Alex Herbert

> On 17 Mar 2020, at 22:34, Claude Warren  wrote:
> 
> Builder discussion:
> 
> Let's go with
> 
>>> Builder with(CharSequence, Charset);
>>> Builder withUnencoded(CharSequence);

Added to master.

I already note that not having it mandate UTF-8 is annoying. I had to include 
StandardCharsets in a lot of places in the test code. So perhaps we add:

Builder withUtf8(CharSequence)

> 
> Shape Discussion:
> 
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
> 
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.

Then you are also mapping the index to a byte index and a bit within the byte. 
So if you are doing these two actions then this is something that you should 
control.

> 
> Bloom Filter Discussion:
> 
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
> 
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
> 
> probability()
> estimateSize()
> 
> Counting Filter Discussion:
> 
> As for counting filters we could implement several
> int
> short
> byte

Yes. Each supports adding a maximum number of the same item. Since we do not 
know the use case leaving it at only int for now is easiest.

The alternative is duplicating the logic for each backing storage, removing the 
public constructor, making the class abstract and providing a factory 
constructor:

ArrayCountingBloomFilter bf = ArrayCountingBloomFilter.create(shape, int 
maximumDuplicateItems);

The actual instance returned is then based on the capacity required to store 
the duplicates.

However the counts at each index are random and may exceed the duplicates by 
chance. I don’t want to go the route of requiring probability computations in 
the factory constructor for likelihood of exceeding the capacity.

A simple approach would be to have a byte[] version when you expect not to add 
duplicates. This filter will simply function to allow removing items you 
previously added. The probabilities I previously listed show that a count of 
127 by random chance is < 1e-100 if the filter is reasonably big. We should at 
least provide a link to this computation. It requires a binomial distribution 
and collections does not current depend on common-math.

The int[] version should be used when you expect to be able to add duplicates 
and want to use the contains(Hasher, count) function.

> 
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
> 
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
> 
> I see no reason not to add them.

OK. 

> 
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert  >
> wrote:
> 
>> 
>> 
>>> On 17 Mar 2020, at 17:06, Claude Warren  wrote:
>>> 
>>> On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert 
>>> wrote:
>>> 

> On 17 Mar 2020, at 15:41, Claude Warren  wrote:
> 
> I agree with the HashFunction changes.

 OK, but which ones?

>>> 
>>> DOH! this one...
>>> 

 Changing HashFunction to have two methods:

 long hash(byte[])
 long increment(int seed)
>> 
>> OK. I’ll update.
>> 

> I think Builder should have
> with(byte[])
> with(byte[], int offset, int len )

 Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
 on a range. This change should be made in conjunction with a similar
>> change
 to HashFunction. So should we update HashFunction to:

>>> Given the depth of the change let's just leave the with( byte[] )
>>> 
>>> 
> with(String)
> 
> I find that I use with(String) more than any other with() method.

 That may be so but String.getBytes(Charset) is trivial to call for the
 user. Then they get to decide on the encoding and not leave it to the
 Hasher. I would use UTF-16 because it would be

Re: [BloomFilters] changes to BloomFilter

2020-03-18 Thread Claude Warren

On a slightly different note.  CountingBloomFilters have no way to perform
a reload.  All other bloom filters you can dump the bits and reload
(trivial) but if you preserve the counts from a bloom filter and want to
reload them you can't.  We need a constructor that takes the index,count
pairs somehow.

On Tue, Mar 17, 2020 at 10:34 PM Claude Warren  wrote:

> Builder discussion:
>
> Let's go with
>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
>
> Shape Discussion:
>
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
>
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.
>
> Bloom Filter Discussion:
>
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
>
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
>
> probability()
> estimateSize()
>
> Counting Filter Discussion:
>
> As for counting filters we could implement several
> int
> short
> byte
>
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
>
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
>
> I see no reason not to add them.
>
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert 
> wrote:
>
>>
>>
>> > On 17 Mar 2020, at 17:06, Claude Warren  wrote:
>> >
>> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert 
>> > wrote:
>> >
>> >>
>> >>
>> >>> On 17 Mar 2020, at 15:41, Claude Warren  wrote:
>> >>>
>> >>> I agree with the HashFunction changes.
>> >>
>> >> OK, but which ones?
>> >>
>> >
>> > DOH! this one...
>> >
>> >>
>> >> Changing HashFunction to have two methods:
>> >>
>> >> long hash(byte[])
>> >> long increment(int seed)
>>
>> OK. I’ll update.
>>
>> >>
>> >>> I think Builder should have
>> >>> with(byte[])
>> >>> with(byte[], int offset, int len )
>> >>
>> >> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>> >> on a range. This change should be made in conjunction with a similar
>> change
>> >> to HashFunction. So should we update HashFunction to:
>> >>
>> >>
>> > Given the depth of the change let's just leave the with( byte[] )
>> >
>> >
>> >>> with(String)
>> >>>
>> >>> I find that I use with(String) more than any other with() method.
>> >>
>> >> That may be so but String.getBytes(Charset) is trivial to call for the
>> >> user. Then they get to decide on the encoding and not leave it to the
>> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>> >> cross-language standard. Leave it out of the API for now, or add both:
>> >>
>> >> Builder with(CharSequence, Charset);
>> >> Builder withUnencoded(CharSequence);
>> >>
>> >
>> > CharSequence has no easy method to convert to a byte[]. While it could
>> be
>> > done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>>
>> I was thinking:
>>
>> /**
>>  * Adds a character sequence item to the hasher using the
>> specified encoding.
>>  *
>>  * @param item the item to add
>>  * @param charset the character set
>>  * @return a reference to this object
>>  */
>> default Builder with(CharSequence item, Charset charset) {
>> return with(item.toString().getBytes(charset));
>> }
>>
>> /**
>>  * Adds a character sequence item to the hasher. Each 16-bit
>> character is
>>  * converted to 2 bytes using little-endian order.
>>  *
>>  * @param item the item to add
>>  * @return a reference to this object
>>  */
>> default Builder withUnencoded(CharSequence item) {
>> final int length = item.length();
>> final byte[] bytes = new byte[length * 2];
>> for (int i = 0; i < length; i++) {
>> final char ch = item.charAt(i);
>> bytes[i * 2] =

Re: Release Announcement: General Availability of Java 14 / JDK 14

2020-03-18 Thread Hasan Diwan

Congrats to all of you who were involved with this project! -- H


On Wed, 18 Mar 2020 at 01:47, Rory O'Donnell 
wrote:

>Hi Benedikt,
>
>
> **Release Announcement: General Availability of Java 14 / JDK 14 [1] * *
>
>   * JDK 14, the reference implementation of Java 14, is now Generally
> Available.
>   * GPL-licensed OpenJDK builds from Oracle are available here:
> https://jdk.java.net/14
>   * JDK 14 Release notes
> <
> https://www.oracle.com/technetwork/java/javase/14-relnote-issues-5809570.html
> >
>
>
>
> JDK 14  includes sixteen features [2]:
>
>305: Pattern Matching for instanceof (Preview)
>343: Packaging Tool (Incubator)
>345: NUMA-Aware Memory Allocation for G1
>349: JFR Event Streaming
>352: Non-Volatile Mapped Byte Buffers
>358: Helpful NullPointerExceptions
>359: Records (Preview)
>361: Switch Expressions (Standard)
>362: Deprecate the Solaris and SPARC Ports
>363: Remove the Concurrent Mark Sweep (CMS) Garbage Collector
>364: ZGC on macOS
>365: ZGC on Windows
>366: Deprecate the ParallelScavenge + SerialOld GC Combination
>367: Remove the Pack200 Tools and API
>368: Text Blocks (Second Preview)
>370: Foreign-Memory Access API (Incubator)
>
> Thanks to everyone who contributed to JDK 14, whether by creating
> features or enhancements, logging  bugs, or downloading and testing the
> early-access builds.
>
> OpenJDK 15 EA build 14 is now available at http://jdk.java.net/15 *
> *
>
>   * These early access, open source builds are provided under the GNU
> General Public License, version 2, with the Classpath Exception
> .
>   * Significant changes since the last availability email:
>   o Build 13 - JDK-8238555
> : Allow
> Initialization of SunPKCS11 with NSS when there are external
> FIPS modules in the NSSDB
>   o Build 10 - JDK-8237776
> : Shenandoah:
> Wrong result with Lucene test
>   + Reported by Apache Lucene.
>   o Build 9 - JDK-8222793
> : Javadoc tool
> ignores "-locale" param and uses default locale for all messages
> and texts
>   + Reported by Apache Lucene.
>
> Project Metropolis Early-Access Builds - Build 14-metropolis+1-17
>  (2020/3/5)
>
>   * These builds are intended for developers looking to test and provide
> feedback on using /Graal,/ in form of native library
> /(libjvmcicompiler.so)/, instead of C2 as HotSpot high optimizing
> JIT compiler.
>   * These early-access builds are provided under the GNU General Public
> License, version 2, with the Classpath Exception
> .
>   * Please send feedback via e-mail to metropolis-...@openjdk.java.net
> . To send e-mail to this
> address you must first subscribe to the mailing list
> .
>
>
> Regards,
> Rory
>
> [1] https://mail.openjdk.java.net/pipermail/jdk-dev/2020-March/004089.html
> [2] https://openjdk.java.net/projects/jdk/14
>
> --
> Rgds, Rory O'Donnell
> Quality Engineering Manager
> Oracle EMEA, Dublin, Ireland
>
>

-- 
OpenPGP:
https://sks-keyservers.net/pks/lookup?op=get=0xFEBAD7FFD041BBA1
If you wish to request my time, please do so using
*bit.ly/hd1AppointmentRequest
*.
Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest
*.

Sent
from my mobile device
Envoye de mon portable

Release Announcement: General Availability of Java 14 / JDK 14

2020-03-18 Thread Rory O'Donnell


  Hi Benedikt,


**Release Announcement: General Availability of Java 14 / JDK 14 [1] * *

 * JDK 14, the reference implementation of Java 14, is now Generally
   Available.
 * GPL-licensed OpenJDK builds from Oracle are available here:
   https://jdk.java.net/14
 * JDK 14 Release notes
   




JDK 14  includes sixteen features [2]:

  305: Pattern Matching for instanceof (Preview)
  343: Packaging Tool (Incubator)
  345: NUMA-Aware Memory Allocation for G1
  349: JFR Event Streaming
  352: Non-Volatile Mapped Byte Buffers
  358: Helpful NullPointerExceptions
  359: Records (Preview)
  361: Switch Expressions (Standard)
  362: Deprecate the Solaris and SPARC Ports
  363: Remove the Concurrent Mark Sweep (CMS) Garbage Collector
  364: ZGC on macOS
  365: ZGC on Windows
  366: Deprecate the ParallelScavenge + SerialOld GC Combination
  367: Remove the Pack200 Tools and API
  368: Text Blocks (Second Preview)
  370: Foreign-Memory Access API (Incubator)

Thanks to everyone who contributed to JDK 14, whether by creating 
features or enhancements, logging  bugs, or downloading and testing the 
early-access builds.


OpenJDK 15 EA build 14 is now available at http://jdk.java.net/15 *
*

 * These early access, open source builds are provided under the GNU
   General Public License, version 2, with the Classpath Exception
   .
 * Significant changes since the last availability email:
 o Build 13 - JDK-8238555
   : Allow
   Initialization of SunPKCS11 with NSS when there are external
   FIPS modules in the NSSDB
 o Build 10 - JDK-8237776
   : Shenandoah:
   Wrong result with Lucene test
 + Reported by Apache Lucene.
 o Build 9 - JDK-8222793
   : Javadoc tool
   ignores "-locale" param and uses default locale for all messages
   and texts
 + Reported by Apache Lucene.

Project Metropolis Early-Access Builds - Build 14-metropolis+1-17 
 (2020/3/5)


 * These builds are intended for developers looking to test and provide
   feedback on using /Graal,/ in form of native library
   /(libjvmcicompiler.so)/, instead of C2 as HotSpot high optimizing
   JIT compiler.
 * These early-access builds are provided under the GNU General Public
   License, version 2, with the Classpath Exception
   .
 * Please send feedback via e-mail to metropolis-...@openjdk.java.net
   . To send e-mail to this
   address you must first subscribe to the mailing list
   .


Regards,
Rory

[1] https://mail.openjdk.java.net/pipermail/jdk-dev/2020-March/004089.html
[2] https://openjdk.java.net/projects/jdk/14

--
Rgds, Rory O'Donnell
Quality Engineering Manager
Oracle EMEA, Dublin, Ireland

Re: [BloomFilters] changes to BloomFilter

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Re: [BloomFilters] changes to BloomFilter

Re: [BloomFilters] changes to BloomFilter

Re: [BloomFilters] changes to BloomFilter

Re: [BloomFilters] changes to BloomFilter

Re: [math]Discussion: How to move out "EmptyClusterStrategy" from KMeansPlusPlusClusterer

Re: [BloomFilters] changes to BloomFilter

Re: [BloomFilters] changes to BloomFilter

Re: [BloomFilters] changes to BloomFilter

Re: Release Announcement: General Availability of Java 14 / JDK 14

Release Announcement: General Availability of Java 14 / JDK 14

12 matches

Site Navigation

Mail list logo

Footer information