Hello everyone,
I experimented a bit with the maven index extraction process and got
some pretty good results (I think).
There might be a way to filter the index during extraction without
noteworthy overhead, which allows the following:
- "sliding window" time filters, e.g drop all documents older than 2
years (aka: who uses old libraries?)
- we can drop fields we don't need from the index. Esp interesting for
fields which don't compress well (looking at you, sha1 hash)
some results for the time cutoff filter:
full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB
now if we throw away some fields we likely don't need we get this:
full: 2.8 GB
2y: 1.4 GB
1y: 0,8 GB
(this would be configurable in the options obviously, someone who
doesn't care about storage like myself, would set it to full index)
Lucene's storage uses immutable files which means a remove operation at
the wrong stage would have no effect (it would only set a bit). This
makes the extraction step the best place for filtering since that is
where the index is built. I am not really a lucene expert, I wouldn't
exclude that there are more ways how to shrink the index.
Some other features of maven-indexer 7+ we would get for free:
- multi threaded extraction (the filter is going to be hooked into
this and is MT too assuming it is accepted upstream).
- lucene 9.6 uses panama on JDK 19+ for memory mapped storage which
makes it also a bit faster (and apparently safer according to the PR),
the devs are already excited for the vector API I have read :)
This brings the extraction time of the *full* central index down to
about 6 minutes on my (aging) machine. The weekly delta updates after
that are much faster.
This all depends of course whether the changes will be accepted upstream
(and also on the JDK 8 problem, but we have other threads for that).
index related and already in master for NB 18:
https://github.com/apache/netbeans/pull/5655
https://github.com/apache/netbeans/pull/5646
blocked:
https://github.com/apache/netbeans/pull/4999
upstream in maven-indexer:
https://github.com/apache/maven-indexer/pull/302
another experiment:
https://github.com/apache/netbeans/pull/4971
best regards,
michael
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@netbeans.apache.org
For additional commands, e-mail: dev-h...@netbeans.apache.org
For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists