Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren
for the remove(), add(), and subtract() methods I agree that void is not
correct and it should be boolean and be the same as the value you would get
from calling isValid().

You are correct the getCounts() should return an iterator of some type on
int[], I don't know why I thought long[].  I am happy with a plain
Iterator as the return.

Claude



On Mon, Mar 2, 2020 at 1:02 AM Alex Herbert 
wrote:

>
>
> > On 1 Mar 2020, at 15:39, Claude Warren  wrote:
> >
> > I think the CountingBloomFilter interface needs to extend BloomFilter.
>
> I said that but did not write it, sorry.
>
> >
> > I think I am confused.
> >
> > I would expect CountingBloomFilter to have
> >
> > interface CountingBloomFilter extends BloomFilter {
> >// these 2 methods are the reverse of merge()
> >void remove( BloomFilter );
> >void remove( Hasher );
>
> Fine. Same intention but different method names. But why void? It forces
> you to check if the remove was valid with a second call. On the plus side
> it matches the void merge(…) methods and in most cases a user would not
> care to check anyway. If they are controlling the filter then leave it to
> them to make sure they do not remove something they did not add.
>
> >
> >// these 2 methods are the addition/subtraction of counts
> >void add( CountingBloomFilter )
> >void subtract( CountingBloomFilter );
>
> Fine. But same comment as above with the void return.
>
> >
> >// 2 methods to retrieve data
> >Stream getCounts();
>
> I don’t like this use of long[]. In my previous post I argued that if you
> were to ever want to store more than max integer items in a filter then the
> Bloom filter would have more bit indices than max integer. So we never have
> to support long counts. A filter that exceeds max integer for a count is
> highly likely to be saturated and no use as a filter anyway.
>
> For most backing implementations the object type of the stream will be
> different so you will have to write a Spliterator implementation or wrap
> some iterator anyway. So why not return the Spliterator:
>
>Spliterator getCounts();
>
> Since the backing implementation will likely not store int[] pairs then
> this will have a lot of object creation and garbage collection overhead to
> go through the counts. This does not seem to be a big concern here if the
> purpose is the same as for the BloomFilter for long[] getBits(), i.e. to
> get a canonical representation for storage.
>
> Note: The Spliterator may not have a known size (number of non zero bits)
> at creation, for example if the counts are stored in a fixed size array.
> Thus efficient parallel traversal by binary tree splitting is limited by
> how evenly the counts are distributed. For a backing implementation using a
> collection then the size should be known. In this case a Spliterator would
> be of more use than a plain Iterator. You can convert one to the other
> using:
>
> java.util.Spliterators:
> public static Iterator iterator(Spliterator spliterator)
> public static  Spliterator spliterator(Iterator
> iterator,
>  long size,
>  int characteristics)
>
> So which to choose for the API?
>
> The Hasher currently uses an Iterator:
>
> PrimitiveIterator.OfInt getBits(Shape shape);
>
> In the case of a StaticHasher this could return a spliterator. But the
> DynamicHasher needs a reworking of the internal Iterator class. It could be
> a Spliterator to use the new IntConsumer API but in most (all) cases
> splitting would not be possible for dynamic hashing as the parts are
> produced in order. It is likely that they will be consumed sequentially too.
>
> I would suggest that Spliterator is the more modern implementation,
> despite not always being applicable to parallelisation in a stream.
> Currently the iterator from the Hasher is used in forEachRemaining() and
> while loop is approximately equal measure. The while loops are for a fast
> exit and would be uglier if rewritten for a
> Spliterator.tryAdvance(IntConsumer) syntax.
>
> There is a use of the IteratorChain in HasherBloomFilter that would need a
> rethink for spliterators.
>
> The path of least resistance is to use Iterator for the API of
> CountingBloomFilter to be consistent with Hasher’s use of Iterator.
>
> WDYT?
>
>
> >boolean isValid()
>
> Fine. Allows some level of feedback that the counts are corrupt.
>
> > }
> >
> > Claude
> >
> >
> > On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert  >
> > wrote:
> >
> >>
> >>
> >>> On 1 Mar 2020, at 09:28, Claude Warren  wrote:
> >>>
> >>> The idea of a backing array is fine and the only problem I see with it
> is
> >>> in very large filters (on the order of 10^8 bits and larger) but
> document
> >>> the size calculation and let the developer worry about it.
> >>
> >> Let us look at the use case where we max out the array. Using the Bloom
> >> filter calculator:
> >>
> >> n = 

Re: [collections] Bloom filters

2020-03-01 Thread Alex Herbert


> On 1 Mar 2020, at 15:39, Claude Warren  wrote:
> 
> I think the CountingBloomFilter interface needs to extend BloomFilter.

I said that but did not write it, sorry.

> 
> I think I am confused.
> 
> I would expect CountingBloomFilter to have
> 
> interface CountingBloomFilter extends BloomFilter {
>// these 2 methods are the reverse of merge()
>void remove( BloomFilter );
>void remove( Hasher );

Fine. Same intention but different method names. But why void? It forces you to 
check if the remove was valid with a second call. On the plus side it matches 
the void merge(…) methods and in most cases a user would not care to check 
anyway. If they are controlling the filter then leave it to them to make sure 
they do not remove something they did not add.

> 
>// these 2 methods are the addition/subtraction of counts
>void add( CountingBloomFilter )
>void subtract( CountingBloomFilter );

Fine. But same comment as above with the void return.

> 
>// 2 methods to retrieve data
>Stream getCounts();

I don’t like this use of long[]. In my previous post I argued that if you were 
to ever want to store more than max integer items in a filter then the Bloom 
filter would have more bit indices than max integer. So we never have to 
support long counts. A filter that exceeds max integer for a count is highly 
likely to be saturated and no use as a filter anyway.

For most backing implementations the object type of the stream will be 
different so you will have to write a Spliterator implementation or wrap 
some iterator anyway. So why not return the Spliterator:

   Spliterator getCounts();

Since the backing implementation will likely not store int[] pairs then this 
will have a lot of object creation and garbage collection overhead to go 
through the counts. This does not seem to be a big concern here if the purpose 
is the same as for the BloomFilter for long[] getBits(), i.e. to get a 
canonical representation for storage.

Note: The Spliterator may not have a known size (number of non zero bits) at 
creation, for example if the counts are stored in a fixed size array. Thus 
efficient parallel traversal by binary tree splitting is limited by how evenly 
the counts are distributed. For a backing implementation using a collection 
then the size should be known. In this case a Spliterator would be of more use 
than a plain Iterator. You can convert one to the other using:

java.util.Spliterators:
public static Iterator iterator(Spliterator spliterator)
public static  Spliterator spliterator(Iterator iterator,
 long size,
 int characteristics)

So which to choose for the API?

The Hasher currently uses an Iterator:

PrimitiveIterator.OfInt getBits(Shape shape);

In the case of a StaticHasher this could return a spliterator. But the 
DynamicHasher needs a reworking of the internal Iterator class. It could be a 
Spliterator to use the new IntConsumer API but in most (all) cases splitting 
would not be possible for dynamic hashing as the parts are produced in order. 
It is likely that they will be consumed sequentially too.

I would suggest that Spliterator is the more modern implementation, despite not 
always being applicable to parallelisation in a stream. Currently the iterator 
from the Hasher is used in forEachRemaining() and while loop is approximately 
equal measure. The while loops are for a fast exit and would be uglier if 
rewritten for a Spliterator.tryAdvance(IntConsumer) syntax.

There is a use of the IteratorChain in HasherBloomFilter that would need a 
rethink for spliterators.

The path of least resistance is to use Iterator for the API of 
CountingBloomFilter to be consistent with Hasher’s use of Iterator.

WDYT?


>boolean isValid()

Fine. Allows some level of feedback that the counts are corrupt.

> }
> 
> Claude
> 
> 
> On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert  >
> wrote:
> 
>> 
>> 
>>> On 1 Mar 2020, at 09:28, Claude Warren  wrote:
>>> 
>>> The idea of a backing array is fine and the only problem I see with it is
>>> in very large filters (on the order of 10^8 bits and larger) but document
>>> the size calculation and let the developer worry about it.
>> 
>> Let us look at the use case where we max out the array. Using the Bloom
>> filter calculator:
>> 
>> n = 149,363,281
>> p = 0.00125 (1 in 1000)
>> m = 2147483647 (256MiB)
>> k = 10
>> 
>> n = 74,681,641
>> p = 0.01 (1 in 50)
>> m = 2147483647 (256MiB)
>> k = 20
>> 
>> n = 49,787,761
>> p = 0.1 (1 in 24899)
>> m = 2147483647 (256MiB)
>> k = 30
>> 
>> So you will be able to put somewhere in the order of 10^8 or 10^7 items
>> into the filter. I would say that anyone putting more than that into the
>> filter has an unusual use case. The CountingBloomFilter can throw an
>> exception if m is too large and will throw an OutOfMemoryError if you
>> cannot 

Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread Romain Manni-Bucau
Until it exists .class were only in / so algorithm was often "scan
everything which is ending with a .class". mjar broke these algorithms :(.
That said, major bump does not solve the fact you will break people, or do
you plan to maintain all versions in parallel just for that?
jpms classifier really mitigates that since most users - all ;) - will not
care having a module-info and the few one wanting to do a demo of jpms can
just change their dependencies to replace them by the classified flavors in
the jpms image build module (only) and everyones will be happy reaching
their goals without having broken anyone.

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le dim. 1 mars 2020 à 18:39, John Patrick  a écrit :

> So spring and other scanners scan classes under /META-INF/versions/...
> i was unaware of that.
> So probably need a major version bump then, so it indicates a major
> change, so that consumers treat it as such and test it.
>
> On Sun, 1 Mar 2020 at 17:11, Romain Manni-Bucau 
> wrote:
> >
> > Le dim. 1 mars 2020 à 18:00, John Patrick  a
> écrit :
> >
> > > Romain, The 4 commons projects I'm referring to already have
> > > Automatic-Module-Name in the Manifest, if that is what you mean by
> > > "explicit module name in the manifest".
> > >
> > > I've been playing around with jdk 1.8, 9, 10, 11, 12, 13 and 14, for
> > > about 18 months and not seen the issue you described about needing
> > > module-info being in the jar root.
> > >
> > > Regarding jpms classifier, I don't think that will help transitive
> > > dependency, so A project depending upon collections which depends upon
> > > codec. So will collections depend upon codec or codec-jpms. From what
> > > I've got working I can add module-info without breaking pre java 11
> > > users.
> > >
> >
> > Test in j8 apps which didnt upgrade their scanner (spring, xbean,
> > reflections, asm and friends, all have version limited to java 8 and not
> > multirelease friendly), all are broken and it is one reason why adding
> > .class with bytecode version > java 8 is very risky while java 8 apps are
> > maintained (and we are still at ~50% of java 8 apps in prod today).
> > Best case it requires to upgrade these libs...so all the stack too which
> is
> > not always desired to upgrade a commons dep, worse case there is to
> upgrade
> > possible so dep is forked or dropped.
> > Indeed you dont see it in plain javase but commons being "commons", we
> must
> > consider this common case too IMHO.
> >
> >
> > > Peter, I'm trying to do as smaller change as possible to add
> > > module-info, and also only using core maven plugins
> > >
> > > Commons Developers, I'll start raising jira tickets per project, per
> step.
> > >
> > > John
> > >
> > >
> > >
> > > John
> > >
> > > On Sun, 1 Mar 2020 at 11:44, Peter Verhas  wrote:
> > > >
> > > > I used Jabel, and Maven profiles to generate jvm8 and also jvm11
> jars in
> > > > the Java::Geci project.
> > > >
> > > > I also wrote a document about what, why, and how I did it:
> > > >
> > > >
> > > > https://javax0.wordpress.com/2019/11/06/supporting-java-8/
> > > >
> > > > It may help.
> > > >
> > > > Peter
> > > >
> > > >
> > > > On Sun, 1 Mar 2020 at 12:38, Romain Manni-Bucau <
> rmannibu...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi John,
> > > > >
> > > > > Didnt check on java 11 ga but in early versions module-info was
> > > required to
> > > > > be at the root of the jar even if overriden for some version (
> > > > > http://openjdk.java.net/jeps/238) so it ends up in the fact that
> some
> > > java
> > > > > 8 apps will be broken by this addition and that only doing it in a
> mjar
> > > > > will lead to the same kind of issues + will be likely ignored for
> other
> > > > > apps.
> > > > >
> > > > > What about doing classified jar "jpms" at the same time than
> current
> > > > > default jar to propose jpms integration early (worse case)?
> > > > >
> > > > > Alternative to just have an explicit module name in the manifest
> but no
> > > > > module-info also sounds safer to me and still enables to use jpms.
> > > > >
> > > > > Hope it makes sense.
> > > > >
> > > > > Le dim. 1 mars 2020 à 12:08, John Patrick 
> a
> > > > > écrit :
> > > > >
> > > > > > I'm wanting to correctly use Collections and Lang as Java
> Modules.
> > > > > >
> > > > > > Looking at the dependencies, I'll need to tackle all four
> projects.
> > > > > >
> > > > > > I'm planning on doing the following as if I'm touching the
> projects I
> > > > > > might as well help out;
> > > > > > 1) spring clean .gitignore, each project has different list, some
> > > > > > documented others not
> > > > > > 2) update to hamcrest v2.2
> > > > > > 3) update to junit v5.6 

Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread John Patrick
So spring and other scanners scan classes under /META-INF/versions/...
i was unaware of that.
So probably need a major version bump then, so it indicates a major
change, so that consumers treat it as such and test it.

On Sun, 1 Mar 2020 at 17:11, Romain Manni-Bucau  wrote:
>
> Le dim. 1 mars 2020 à 18:00, John Patrick  a écrit :
>
> > Romain, The 4 commons projects I'm referring to already have
> > Automatic-Module-Name in the Manifest, if that is what you mean by
> > "explicit module name in the manifest".
> >
> > I've been playing around with jdk 1.8, 9, 10, 11, 12, 13 and 14, for
> > about 18 months and not seen the issue you described about needing
> > module-info being in the jar root.
> >
> > Regarding jpms classifier, I don't think that will help transitive
> > dependency, so A project depending upon collections which depends upon
> > codec. So will collections depend upon codec or codec-jpms. From what
> > I've got working I can add module-info without breaking pre java 11
> > users.
> >
>
> Test in j8 apps which didnt upgrade their scanner (spring, xbean,
> reflections, asm and friends, all have version limited to java 8 and not
> multirelease friendly), all are broken and it is one reason why adding
> .class with bytecode version > java 8 is very risky while java 8 apps are
> maintained (and we are still at ~50% of java 8 apps in prod today).
> Best case it requires to upgrade these libs...so all the stack too which is
> not always desired to upgrade a commons dep, worse case there is to upgrade
> possible so dep is forked or dropped.
> Indeed you dont see it in plain javase but commons being "commons", we must
> consider this common case too IMHO.
>
>
> > Peter, I'm trying to do as smaller change as possible to add
> > module-info, and also only using core maven plugins
> >
> > Commons Developers, I'll start raising jira tickets per project, per step.
> >
> > John
> >
> >
> >
> > John
> >
> > On Sun, 1 Mar 2020 at 11:44, Peter Verhas  wrote:
> > >
> > > I used Jabel, and Maven profiles to generate jvm8 and also jvm11 jars in
> > > the Java::Geci project.
> > >
> > > I also wrote a document about what, why, and how I did it:
> > >
> > >
> > > https://javax0.wordpress.com/2019/11/06/supporting-java-8/
> > >
> > > It may help.
> > >
> > > Peter
> > >
> > >
> > > On Sun, 1 Mar 2020 at 12:38, Romain Manni-Bucau 
> > > wrote:
> > >
> > > > Hi John,
> > > >
> > > > Didnt check on java 11 ga but in early versions module-info was
> > required to
> > > > be at the root of the jar even if overriden for some version (
> > > > http://openjdk.java.net/jeps/238) so it ends up in the fact that some
> > java
> > > > 8 apps will be broken by this addition and that only doing it in a mjar
> > > > will lead to the same kind of issues + will be likely ignored for other
> > > > apps.
> > > >
> > > > What about doing classified jar "jpms" at the same time than current
> > > > default jar to propose jpms integration early (worse case)?
> > > >
> > > > Alternative to just have an explicit module name in the manifest but no
> > > > module-info also sounds safer to me and still enables to use jpms.
> > > >
> > > > Hope it makes sense.
> > > >
> > > > Le dim. 1 mars 2020 à 12:08, John Patrick  a
> > > > écrit :
> > > >
> > > > > I'm wanting to correctly use Collections and Lang as Java Modules.
> > > > >
> > > > > Looking at the dependencies, I'll need to tackle all four projects.
> > > > >
> > > > > I'm planning on doing the following as if I'm touching the projects I
> > > > > might as well help out;
> > > > > 1) spring clean .gitignore, each project has different list, some
> > > > > documented others not
> > > > > 2) update to hamcrest v2.2
> > > > > 3) update to junit v5.6 (jupiter and vintage)
> > > > > 4) tackle low hanging fruit tests that can quickly be moved to
> > jupiter
> > > > > 5) update maven-dependencies to latest for compiler, jar, install
> > > > > 6) discuss how toolchain should be setup for build tools have access
> > > > > to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
> > > > > commons projects
> > > > > 7) add src/main/java11/module-info.java
> > > > > 8) add toolchain plugin and update compiler plugin to make Multi
> > Release
> > > > > jar
> > > > >
> > > > > my questions are;
> > > > > a) how many jira tickets? one pre project? one per task/pr?
> > > > > b) prepare into new release i.e. commons-collection5
> > > > > c) should i do all my pom changes to a parent pom?
> > > > > d) toolchains developers, how should I document the requirement for
> > > > > developers want to build locally, give example which they need to
> > > > > complete
> > > > > e) toolchains cicd, how should i set that up? i'm use to jenkins and
> > > > > circle ci, not travis or gitflow?
> > > > >
> > > > > I'm doing this for other open source projects and work projects, but
> > > > > can't finish the open source or work projects until the open source
> > > > > dependency projects get fixed. Updating these 4 

Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread Romain Manni-Bucau
Le dim. 1 mars 2020 à 18:00, John Patrick  a écrit :

> Romain, The 4 commons projects I'm referring to already have
> Automatic-Module-Name in the Manifest, if that is what you mean by
> "explicit module name in the manifest".
>
> I've been playing around with jdk 1.8, 9, 10, 11, 12, 13 and 14, for
> about 18 months and not seen the issue you described about needing
> module-info being in the jar root.
>
> Regarding jpms classifier, I don't think that will help transitive
> dependency, so A project depending upon collections which depends upon
> codec. So will collections depend upon codec or codec-jpms. From what
> I've got working I can add module-info without breaking pre java 11
> users.
>

Test in j8 apps which didnt upgrade their scanner (spring, xbean,
reflections, asm and friends, all have version limited to java 8 and not
multirelease friendly), all are broken and it is one reason why adding
.class with bytecode version > java 8 is very risky while java 8 apps are
maintained (and we are still at ~50% of java 8 apps in prod today).
Best case it requires to upgrade these libs...so all the stack too which is
not always desired to upgrade a commons dep, worse case there is to upgrade
possible so dep is forked or dropped.
Indeed you dont see it in plain javase but commons being "commons", we must
consider this common case too IMHO.


> Peter, I'm trying to do as smaller change as possible to add
> module-info, and also only using core maven plugins
>
> Commons Developers, I'll start raising jira tickets per project, per step.
>
> John
>
>
>
> John
>
> On Sun, 1 Mar 2020 at 11:44, Peter Verhas  wrote:
> >
> > I used Jabel, and Maven profiles to generate jvm8 and also jvm11 jars in
> > the Java::Geci project.
> >
> > I also wrote a document about what, why, and how I did it:
> >
> >
> > https://javax0.wordpress.com/2019/11/06/supporting-java-8/
> >
> > It may help.
> >
> > Peter
> >
> >
> > On Sun, 1 Mar 2020 at 12:38, Romain Manni-Bucau 
> > wrote:
> >
> > > Hi John,
> > >
> > > Didnt check on java 11 ga but in early versions module-info was
> required to
> > > be at the root of the jar even if overriden for some version (
> > > http://openjdk.java.net/jeps/238) so it ends up in the fact that some
> java
> > > 8 apps will be broken by this addition and that only doing it in a mjar
> > > will lead to the same kind of issues + will be likely ignored for other
> > > apps.
> > >
> > > What about doing classified jar "jpms" at the same time than current
> > > default jar to propose jpms integration early (worse case)?
> > >
> > > Alternative to just have an explicit module name in the manifest but no
> > > module-info also sounds safer to me and still enables to use jpms.
> > >
> > > Hope it makes sense.
> > >
> > > Le dim. 1 mars 2020 à 12:08, John Patrick  a
> > > écrit :
> > >
> > > > I'm wanting to correctly use Collections and Lang as Java Modules.
> > > >
> > > > Looking at the dependencies, I'll need to tackle all four projects.
> > > >
> > > > I'm planning on doing the following as if I'm touching the projects I
> > > > might as well help out;
> > > > 1) spring clean .gitignore, each project has different list, some
> > > > documented others not
> > > > 2) update to hamcrest v2.2
> > > > 3) update to junit v5.6 (jupiter and vintage)
> > > > 4) tackle low hanging fruit tests that can quickly be moved to
> jupiter
> > > > 5) update maven-dependencies to latest for compiler, jar, install
> > > > 6) discuss how toolchain should be setup for build tools have access
> > > > to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
> > > > commons projects
> > > > 7) add src/main/java11/module-info.java
> > > > 8) add toolchain plugin and update compiler plugin to make Multi
> Release
> > > > jar
> > > >
> > > > my questions are;
> > > > a) how many jira tickets? one pre project? one per task/pr?
> > > > b) prepare into new release i.e. commons-collection5
> > > > c) should i do all my pom changes to a parent pom?
> > > > d) toolchains developers, how should I document the requirement for
> > > > developers want to build locally, give example which they need to
> > > > complete
> > > > e) toolchains cicd, how should i set that up? i'm use to jenkins and
> > > > circle ci, not travis or gitflow?
> > > >
> > > > I'm doing this for other open source projects and work projects, but
> > > > can't finish the open source or work projects until the open source
> > > > dependency projects get fixed. Updating these 4 commons project will
> > > > unblock about 10 downstream pr's i'm working on.
> > > >
> > > > I don't want to break backwards compatibility so don't want to just
> > > > update to jdk 11, that is why I'm suggesting to use multi release
> > > > jars, so module-info.java is in META-INF/versions/11/module-info.jar.
> > > > I don't want to add new features, I just want the module-info so if
> > > > you are using a newer jdk you can use modules.
> > > >
> > > > John
> > > >
> > > > 

Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread John Patrick
Romain, The 4 commons projects I'm referring to already have
Automatic-Module-Name in the Manifest, if that is what you mean by
"explicit module name in the manifest".

I've been playing around with jdk 1.8, 9, 10, 11, 12, 13 and 14, for
about 18 months and not seen the issue you described about needing
module-info being in the jar root.

Regarding jpms classifier, I don't think that will help transitive
dependency, so A project depending upon collections which depends upon
codec. So will collections depend upon codec or codec-jpms. From what
I've got working I can add module-info without breaking pre java 11
users.

Peter, I'm trying to do as smaller change as possible to add
module-info, and also only using core maven plugins

Commons Developers, I'll start raising jira tickets per project, per step.

John



John

On Sun, 1 Mar 2020 at 11:44, Peter Verhas  wrote:
>
> I used Jabel, and Maven profiles to generate jvm8 and also jvm11 jars in
> the Java::Geci project.
>
> I also wrote a document about what, why, and how I did it:
>
>
> https://javax0.wordpress.com/2019/11/06/supporting-java-8/
>
> It may help.
>
> Peter
>
>
> On Sun, 1 Mar 2020 at 12:38, Romain Manni-Bucau 
> wrote:
>
> > Hi John,
> >
> > Didnt check on java 11 ga but in early versions module-info was required to
> > be at the root of the jar even if overriden for some version (
> > http://openjdk.java.net/jeps/238) so it ends up in the fact that some java
> > 8 apps will be broken by this addition and that only doing it in a mjar
> > will lead to the same kind of issues + will be likely ignored for other
> > apps.
> >
> > What about doing classified jar "jpms" at the same time than current
> > default jar to propose jpms integration early (worse case)?
> >
> > Alternative to just have an explicit module name in the manifest but no
> > module-info also sounds safer to me and still enables to use jpms.
> >
> > Hope it makes sense.
> >
> > Le dim. 1 mars 2020 à 12:08, John Patrick  a
> > écrit :
> >
> > > I'm wanting to correctly use Collections and Lang as Java Modules.
> > >
> > > Looking at the dependencies, I'll need to tackle all four projects.
> > >
> > > I'm planning on doing the following as if I'm touching the projects I
> > > might as well help out;
> > > 1) spring clean .gitignore, each project has different list, some
> > > documented others not
> > > 2) update to hamcrest v2.2
> > > 3) update to junit v5.6 (jupiter and vintage)
> > > 4) tackle low hanging fruit tests that can quickly be moved to jupiter
> > > 5) update maven-dependencies to latest for compiler, jar, install
> > > 6) discuss how toolchain should be setup for build tools have access
> > > to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
> > > commons projects
> > > 7) add src/main/java11/module-info.java
> > > 8) add toolchain plugin and update compiler plugin to make Multi Release
> > > jar
> > >
> > > my questions are;
> > > a) how many jira tickets? one pre project? one per task/pr?
> > > b) prepare into new release i.e. commons-collection5
> > > c) should i do all my pom changes to a parent pom?
> > > d) toolchains developers, how should I document the requirement for
> > > developers want to build locally, give example which they need to
> > > complete
> > > e) toolchains cicd, how should i set that up? i'm use to jenkins and
> > > circle ci, not travis or gitflow?
> > >
> > > I'm doing this for other open source projects and work projects, but
> > > can't finish the open source or work projects until the open source
> > > dependency projects get fixed. Updating these 4 commons project will
> > > unblock about 10 downstream pr's i'm working on.
> > >
> > > I don't want to break backwards compatibility so don't want to just
> > > update to jdk 11, that is why I'm suggesting to use multi release
> > > jars, so module-info.java is in META-INF/versions/11/module-info.jar.
> > > I don't want to add new features, I just want the module-info so if
> > > you are using a newer jdk you can use modules.
> > >
> > > John
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: dev-h...@commons.apache.org
> > >
> > >
> >
> --
> Peter Verhas
> pe...@verhas.com
> t: +41791542095
> skype: verhas

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren
I think the CountingBloomFilter interface needs to extend BloomFilter.

I think I am confused.

I would expect CountingBloomFilter to have

interface CountingBloomFilter extends BloomFilter {
// these 2 methods are the reverse of merge()
void remove( BloomFilter );
void remove( Hasher );

// these 2 methods are the addition/subtraction of counts
void add( CountingBloomFilter )
void subtract( CountingBloomFilter );

// 2 methods to retrieve data
Stream getCounts();
boolean isValid()
}

Claude


On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert 
wrote:

>
>
> > On 1 Mar 2020, at 09:28, Claude Warren  wrote:
> >
> > The idea of a backing array is fine and the only problem I see with it is
> > in very large filters (on the order of 10^8 bits and larger) but document
> > the size calculation and let the developer worry about it.
>
> Let us look at the use case where we max out the array. Using the Bloom
> filter calculator:
>
> n = 149,363,281
> p = 0.00125 (1 in 1000)
> m = 2147483647 (256MiB)
> k = 10
>
> n = 74,681,641
> p = 0.01 (1 in 50)
> m = 2147483647 (256MiB)
> k = 20
>
> n = 49,787,761
> p = 0.1 (1 in 24899)
> m = 2147483647 (256MiB)
> k = 30
>
> So you will be able to put somewhere in the order of 10^8 or 10^7 items
> into the filter. I would say that anyone putting more than that into the
> filter has an unusual use case. The CountingBloomFilter can throw an
> exception if m is too large and will throw an OutOfMemoryError if you
> cannot allocate an array large enough.
>
> One clear point here is that you cannot use a short as a 16-bit count
> would easily overflow. So you have to use an integer array for the counts.
>
> A maximum length int[] is roughly 8GB.
>
> What would another implementation cost in terms of memory? The
> TreeMap was the most space efficient. In the previous
> e-mail the saturation of a Bloom filter bits was approximately 50% when at
> the intended capacity. So we have to estimate the size of a TreeMap
> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the
> TreeMap memory scales linearly with entries:
>
>  32768 /  65536 (0.500) : TreeMap  =  1834061 bytes
>  65536 / 131072 (0.500) : TreeMap  =  3669080 bytes
> 131072 / 262144 (0.500) : TreeMap  =  7339090 bytes
>
> So what is the memory for a TreeMap with 2^30 indices. I make it about:
>
> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
>
> I would say that this amount of RAM is unusual. It is definitely not as
> efficient as using an array. So very large counting Bloom filters are going
> to require some thought as to the hardware they run on. This may not be the
> case in 10 years time.
>
> I would say that we try an int[] backing array for the storage
> implementation and document it’s limitations. A different implementation
> could be provided in future if required.
>
> This could be done by making CountingBloomFilter an interface that extends
> BloomFilter with the methods:
>
> subtract(BloomFilter filter)
> subtract(Hasher filter)
>
> These will negate the effect of the corresponding merge(BloomFilter)
> operation.
>
> Do we also need access to the counts and add/subtract of another
> CountingBloomFilter?:
>
> add(CountingBloomFilter filter);
> subtract(CountingBloomFilter filter);
>
> Iterator getCounts();
> int getSize(); // Number of items added
>
> The CountingBloomFilter is then an interface that defines how to reverse
> the merge of some bits into the filter.
>
> My concern is the inefficiency of the creation of objects in any method
> that provides access to the counts (e.g. above using an iterator as for
> Hasher.getBits()). I presume this method would be to allow some type of
> storage/serialisation of the filter akin to the long[] getBits() method of
> BloomFilter. So it may be better done using a method:
>
> int getCount(int index);
>
> The caller can then use long[] getBits() to get the indices set in the
> filter and then for each non-zero bit index call getCount(index). Or just
> not include the method as the counts are only of concern when storing the
> filter. This functionality is cleaner pushed into an implementation.
>
> In a previous post we discussed whether to throw an exception on
> overflow/underflow or raise in an invalid flag. Using the invalid flag idea
> the interface would be:
>
> interface CountingBloomFilter {
> int add(CountingBloomFilter filter);
> int subtract(BloomFilter filter);
> int subtract(Hasher filter);
> int subtract(CountingBloomFilter filter);
> int getStatus();
>
> // Maybe
> int getSize();
> int getCount(int index);
> }
>
> The status would be negative if any operation overflowed/underflowed, or
> zero if OK. The current status is returned by the add/subtract methods.
>
> However I note that overflow may not be a concern. The number of items to
> add to a filter to create overflow would be using a filter with a number of
> bits that is 

Re: [collections] Bloom filters

2020-03-01 Thread Alex Herbert


> On 1 Mar 2020, at 09:28, Claude Warren  wrote:
> 
> The idea of a backing array is fine and the only problem I see with it is
> in very large filters (on the order of 10^8 bits and larger) but document
> the size calculation and let the developer worry about it.

Let us look at the use case where we max out the array. Using the Bloom filter 
calculator:

n = 149,363,281
p = 0.00125 (1 in 1000)
m = 2147483647 (256MiB)
k = 10

n = 74,681,641
p = 0.01 (1 in 50)
m = 2147483647 (256MiB)
k = 20

n = 49,787,761
p = 0.1 (1 in 24899)
m = 2147483647 (256MiB)
k = 30

So you will be able to put somewhere in the order of 10^8 or 10^7 items into 
the filter. I would say that anyone putting more than that into the filter has 
an unusual use case. The CountingBloomFilter can throw an exception if m is too 
large and will throw an OutOfMemoryError if you cannot allocate an array large 
enough. 

One clear point here is that you cannot use a short as a 16-bit count would 
easily overflow. So you have to use an integer array for the counts.

A maximum length int[] is roughly 8GB.

What would another implementation cost in terms of memory? The TreeMap was the most space efficient. In the previous e-mail the saturation of 
a Bloom filter bits was approximately 50% when at the intended capacity. So we 
have to estimate the size of a TreeMap containing Integer.MAX_VALUE/2 indices ~ 
2^30. The memory test shows the TreeMap memory scales linearly with entries:

 32768 /  65536 (0.500) : TreeMap  =  1834061 bytes
 65536 / 131072 (0.500) : TreeMap  =  3669080 bytes
131072 / 262144 (0.500) : TreeMap  =  7339090 bytes

So what is the memory for a TreeMap with 2^30 indices. I make it about:

(2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB

I would say that this amount of RAM is unusual. It is definitely not as 
efficient as using an array. So very large counting Bloom filters are going to 
require some thought as to the hardware they run on. This may not be the case 
in 10 years time.

I would say that we try an int[] backing array for the storage implementation 
and document it’s limitations. A different implementation could be provided in 
future if required.

This could be done by making CountingBloomFilter an interface that extends 
BloomFilter with the methods:

subtract(BloomFilter filter)
subtract(Hasher filter)

These will negate the effect of the corresponding merge(BloomFilter) operation.

Do we also need access to the counts and add/subtract of another 
CountingBloomFilter?:

add(CountingBloomFilter filter);
subtract(CountingBloomFilter filter);

Iterator getCounts();
int getSize(); // Number of items added

The CountingBloomFilter is then an interface that defines how to reverse the 
merge of some bits into the filter.

My concern is the inefficiency of the creation of objects in any method that 
provides access to the counts (e.g. above using an iterator as for 
Hasher.getBits()). I presume this method would be to allow some type of 
storage/serialisation of the filter akin to the long[] getBits() method of 
BloomFilter. So it may be better done using a method:

int getCount(int index);

The caller can then use long[] getBits() to get the indices set in the filter 
and then for each non-zero bit index call getCount(index). Or just not include 
the method as the counts are only of concern when storing the filter. This 
functionality is cleaner pushed into an implementation.

In a previous post we discussed whether to throw an exception on 
overflow/underflow or raise in an invalid flag. Using the invalid flag idea the 
interface would be:

interface CountingBloomFilter {
int add(CountingBloomFilter filter);
int subtract(BloomFilter filter);
int subtract(Hasher filter);
int subtract(CountingBloomFilter filter);
int getStatus();

// Maybe
int getSize();
int getCount(int index);
}

The status would be negative if any operation overflowed/underflowed, or zero 
if OK. The current status is returned by the add/subtract methods.

However I note that overflow may not be a concern. The number of items to add 
to a filter to create overflow would be using a filter with a number of bits 
that is unrealistic to store in memory:

n = 2147483647
p = 0.00125 (1 in 1000)
m = 30875634182 (3.59GiB)
k = 10

If you want to add 2 billion items (and overflow an integer count) then your 
filter would be so big it would break the rest of the API that uses a 32-bit 
int for the bit index.

Thus only underflow is a realistic concern. This could be documented as handled 
in an implementation specific manner (i.e. throw or ignore). The API is then 
simplified to:

interface CountingBloomFilter {
boolean add(CountingBloomFilter filter);
boolean subtract(BloomFilter filter);
boolean subtract(Hasher filter);
boolean subtract(CountingBloomFilter filter);
int getStatus();

// Maybe
int getSize();
int getCount(int index);
}

The boolean is 

Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread Peter Verhas
I used Jabel, and Maven profiles to generate jvm8 and also jvm11 jars in
the Java::Geci project.

I also wrote a document about what, why, and how I did it:


https://javax0.wordpress.com/2019/11/06/supporting-java-8/

It may help.

Peter


On Sun, 1 Mar 2020 at 12:38, Romain Manni-Bucau 
wrote:

> Hi John,
>
> Didnt check on java 11 ga but in early versions module-info was required to
> be at the root of the jar even if overriden for some version (
> http://openjdk.java.net/jeps/238) so it ends up in the fact that some java
> 8 apps will be broken by this addition and that only doing it in a mjar
> will lead to the same kind of issues + will be likely ignored for other
> apps.
>
> What about doing classified jar "jpms" at the same time than current
> default jar to propose jpms integration early (worse case)?
>
> Alternative to just have an explicit module name in the manifest but no
> module-info also sounds safer to me and still enables to use jpms.
>
> Hope it makes sense.
>
> Le dim. 1 mars 2020 à 12:08, John Patrick  a
> écrit :
>
> > I'm wanting to correctly use Collections and Lang as Java Modules.
> >
> > Looking at the dependencies, I'll need to tackle all four projects.
> >
> > I'm planning on doing the following as if I'm touching the projects I
> > might as well help out;
> > 1) spring clean .gitignore, each project has different list, some
> > documented others not
> > 2) update to hamcrest v2.2
> > 3) update to junit v5.6 (jupiter and vintage)
> > 4) tackle low hanging fruit tests that can quickly be moved to jupiter
> > 5) update maven-dependencies to latest for compiler, jar, install
> > 6) discuss how toolchain should be setup for build tools have access
> > to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
> > commons projects
> > 7) add src/main/java11/module-info.java
> > 8) add toolchain plugin and update compiler plugin to make Multi Release
> > jar
> >
> > my questions are;
> > a) how many jira tickets? one pre project? one per task/pr?
> > b) prepare into new release i.e. commons-collection5
> > c) should i do all my pom changes to a parent pom?
> > d) toolchains developers, how should I document the requirement for
> > developers want to build locally, give example which they need to
> > complete
> > e) toolchains cicd, how should i set that up? i'm use to jenkins and
> > circle ci, not travis or gitflow?
> >
> > I'm doing this for other open source projects and work projects, but
> > can't finish the open source or work projects until the open source
> > dependency projects get fixed. Updating these 4 commons project will
> > unblock about 10 downstream pr's i'm working on.
> >
> > I don't want to break backwards compatibility so don't want to just
> > update to jdk 11, that is why I'm suggesting to use multi release
> > jars, so module-info.java is in META-INF/versions/11/module-info.jar.
> > I don't want to add new features, I just want the module-info so if
> > you are using a newer jdk you can use modules.
> >
> > John
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>
-- 
Peter Verhas
pe...@verhas.com
t: +41791542095
skype: verhas


Re: Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread Romain Manni-Bucau
Hi John,

Didnt check on java 11 ga but in early versions module-info was required to
be at the root of the jar even if overriden for some version (
http://openjdk.java.net/jeps/238) so it ends up in the fact that some java
8 apps will be broken by this addition and that only doing it in a mjar
will lead to the same kind of issues + will be likely ignored for other
apps.

What about doing classified jar "jpms" at the same time than current
default jar to propose jpms integration early (worse case)?

Alternative to just have an explicit module name in the manifest but no
module-info also sounds safer to me and still enables to use jpms.

Hope it makes sense.

Le dim. 1 mars 2020 à 12:08, John Patrick  a écrit :

> I'm wanting to correctly use Collections and Lang as Java Modules.
>
> Looking at the dependencies, I'll need to tackle all four projects.
>
> I'm planning on doing the following as if I'm touching the projects I
> might as well help out;
> 1) spring clean .gitignore, each project has different list, some
> documented others not
> 2) update to hamcrest v2.2
> 3) update to junit v5.6 (jupiter and vintage)
> 4) tackle low hanging fruit tests that can quickly be moved to jupiter
> 5) update maven-dependencies to latest for compiler, jar, install
> 6) discuss how toolchain should be setup for build tools have access
> to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
> commons projects
> 7) add src/main/java11/module-info.java
> 8) add toolchain plugin and update compiler plugin to make Multi Release
> jar
>
> my questions are;
> a) how many jira tickets? one pre project? one per task/pr?
> b) prepare into new release i.e. commons-collection5
> c) should i do all my pom changes to a parent pom?
> d) toolchains developers, how should I document the requirement for
> developers want to build locally, give example which they need to
> complete
> e) toolchains cicd, how should i set that up? i'm use to jenkins and
> circle ci, not travis or gitflow?
>
> I'm doing this for other open source projects and work projects, but
> can't finish the open source or work projects until the open source
> dependency projects get fixed. Updating these 4 commons project will
> unblock about 10 downstream pr's i'm working on.
>
> I don't want to break backwards compatibility so don't want to just
> update to jdk 11, that is why I'm suggesting to use multi release
> jars, so module-info.java is in META-INF/versions/11/module-info.jar.
> I don't want to add new features, I just want the module-info so if
> you are using a newer jdk you can use modules.
>
> John
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>


Java Modules Codec, Collections, IO and Lang

2020-03-01 Thread John Patrick
I'm wanting to correctly use Collections and Lang as Java Modules.

Looking at the dependencies, I'll need to tackle all four projects.

I'm planning on doing the following as if I'm touching the projects I
might as well help out;
1) spring clean .gitignore, each project has different list, some
documented others not
2) update to hamcrest v2.2
3) update to junit v5.6 (jupiter and vintage)
4) tackle low hanging fruit tests that can quickly be moved to jupiter
5) update maven-dependencies to latest for compiler, jar, install
6) discuss how toolchain should be setup for build tools have access
to jdk 1.8 and jdk 11, ideally the same pattern/approach for all
commons projects
7) add src/main/java11/module-info.java
8) add toolchain plugin and update compiler plugin to make Multi Release jar

my questions are;
a) how many jira tickets? one pre project? one per task/pr?
b) prepare into new release i.e. commons-collection5
c) should i do all my pom changes to a parent pom?
d) toolchains developers, how should I document the requirement for
developers want to build locally, give example which they need to
complete
e) toolchains cicd, how should i set that up? i'm use to jenkins and
circle ci, not travis or gitflow?

I'm doing this for other open source projects and work projects, but
can't finish the open source or work projects until the open source
dependency projects get fixed. Updating these 4 commons project will
unblock about 10 downstream pr's i'm working on.

I don't want to break backwards compatibility so don't want to just
update to jdk 11, that is why I'm suggesting to use multi release
jars, so module-info.java is in META-INF/versions/11/module-info.jar.
I don't want to add new features, I just want the module-info so if
you are using a newer jdk you can use modules.

John

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [collections] Bloom filters

2020-03-01 Thread Claude Warren
The idea of a backing array is fine and the only problem I see with it is
in very large filters (on the order of 10^8 bits and larger) but document
the size calculation and let the developer worry about it.

As for the merge question.  merge is a standard bloom filter operation.  It
is well defined in the literature.  Merging a bloom filter into a counting
bloom filter means incrementing the bit counts.  I  think that merge/remove
should continue to operate as though the parameter were a standard bloom
filter.

We had spoken of implementing and adding/deleting method pair that would
operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
add(CountingBloomFilter) and subtract(CountingBloomFilter))

I disagree with your proposal for the merge(Hasher) implementation, and I
am not certain that an add(Hasher) makes sense.  First consider that the
Hasher returns the bits that are to be enabled in the Bloom filter so
collisions are expected.  In the normal course of events a Hasher is used
to create a normal Bloom filter where all the duplicates are removed.  That
filter is then merged into a CountingBloomFilter.  So in some sense the
Hasher and the normal Bloom filter are the same.  So I would expect the
merge of the Hasher and the merge of the normal Bloom filter created from
that hasher into a CountingBloomFilter to yield the same result.  If you
wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
CountingBloomFilter you could implement with duplicate counting, but I am
not certain of the validity of such a count and I fear that it muddies the
waters with respect to what the CountingBloomFilter is counting.

Claude

On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert 
wrote:

>
>
> > On 29 Feb 2020, at 07:46, Claude Warren  wrote:
> >
> > Alex would you take a look at pull request 131 [1].  it adds a new hasher
> > implementation and makes the HashFunctionValidator available for public
> use.
> >
> > https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131>
>
> OK. I’ll take a look.
>
> I’ve been thinking about the counting Bloom filter and the backing
> storage. In summary:
>
> 1. The backing storage should be a fixed array.
> 2. Merging a Hasher should count duplicate indices, not flatten them all
> to a single count.
>
> For background I’ve used the standard formulas to estimate the number of
> indices that will be non-zero in a Bloom filter. The wikipedia page gives
> this formula for the expected number of bits set to 0 (E(q)) if you have
> inserted i elements into a filter of size m using k hash functions:
>
> E(q) = (1 - 1/m)^ki"
>
> So a rough guess of the number of indices (bits) used by a filter is
> 1-E(q).
>
> Here is a table of Bloom filters with different collision probabilities
> and the proportion of bits that will be set when 1%, 10%, 100% of the
> capacity of the filter has been met:
>
> n   p   m   k   I   E(q)bits
> 10001E-04   19171   13  10  0.9932  0.0068
> 10001E-04   19171   13  100 0.9344  0.0656
> 10001E-04   19171   13  10000.5076  0.4924
> 10001E-05   23963   17  10  0.9929  0.0071
> 10001E-05   23963   17  100 0.9315  0.0685
> 10001E-05   23963   17  10000.4919  0.5081
> 10001E-06   28756   20  10  0.9931  0.0069
> 10001E-06   28756   20  100 0.9328  0.0672
> 10001E-06   28756   20  10000.4988  0.5012
> 1   1E-06   287552  20  100 0.9931  0.0069
> 1   1E-06   287552  20  10000.9328  0.0672
> 1   1E-06   287552  20  1   0.4988  0.5012
>
> The point is that if you create a Bloom filter and fill it to 10% of the
> intended capacity the number of indices used will be about 6-7% of the
> filter bits.
>
> So how to store the counts? Currently the counting bloom filter uses a
> TreeMap. I tried:
>
> TreeMap
> HashMap
> TreeSet
> HashSet
> int[]
>
> The MutableCount is a custom object that stores the bit index and uses it
> for a hash code and then has a mutable integer count field. It allows the
> count to be incremented/decremented if the object is in the set:
>
> static final class MutableCount implements Comparable {
> final int key;
> int count;
> // etc
> }
>
> This is adapted from the Bag collection which stores an item count with
> a MutableInteger. Here the mutable count is part of the same object T
> inserted into the Set. So you can find the object, change the count and not
> have to put it back into the set. This is more efficient than changing the
> Integer stored in a Map.
>
> I’ve estimated memory usage using an idea based on this article from
> JavaWorld: Java Tip 130: Do you know your data size? [1].
>
> Basically you:
>
> - create an object and throw it away. All classes are then initialised.
> - Then you free memory (run garbage collection) and get the current memory
> usage
> - Then create