Hello everyone,

I experimented a bit with the maven index extraction process and got some pretty good results (I think).

There might be a way to filter the index during extraction without noteworthy overhead, which allows the following:

 - "sliding window" time filters, e.g drop all documents older than 2 years (aka: who uses old libraries?)

 - we can drop fields we don't need from the index. Esp interesting for fields which don't compress well (looking at you, sha1 hash)

some results for the time cutoff filter:

full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB

now if we throw away some fields we likely don't need we get this:

full: 2.8 GB
2y: 1.4 GB
1y: 0,8 GB

(this would be configurable in the options obviously, someone who doesn't care about storage like myself, would set it to full index)

Lucene's storage uses immutable files which means a remove operation at the wrong stage would have no effect (it would only set a bit). This makes the extraction step the best place for filtering since that is where the index is built. I am not really a lucene expert, I wouldn't exclude that there are more ways how to shrink the index.

Some other features of maven-indexer 7+ we would get for free:

 - multi threaded extraction (the filter is going to be hooked into this and is MT too assuming it is accepted upstream).

 - lucene 9.6 uses panama on JDK 19+ for memory mapped storage which makes it also a bit faster (and apparently safer according to the PR), the devs are already excited for the vector API I have read :)

This brings the extraction time of the *full* central index down to about 6 minutes on my (aging) machine. The weekly delta updates after that are much faster.

This all depends of course whether the changes will be accepted upstream (and also on the JDK 8 problem, but we have other threads for that).


index related and already in master for NB 18:

https://github.com/apache/netbeans/pull/5655

https://github.com/apache/netbeans/pull/5646

blocked:

https://github.com/apache/netbeans/pull/4999

upstream in maven-indexer:

https://github.com/apache/maven-indexer/pull/302

another experiment:

https://github.com/apache/netbeans/pull/4971


best regards,

michael


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@netbeans.apache.org
For additional commands, e-mail: dev-h...@netbeans.apache.org

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists



Reply via email to