Re: maven indexing tweaks

Michael Bien Sat, 18 Mar 2023 19:42:14 -0700

On 18.03.23 14:41, Eirik Bakke wrote:

- "sliding window" time filters, e.g drop all documents older than 2 years 
(aka: who uses old libraries?)


Is "document" the same as "maven artifact" here?

yeah pretty much. A lucene document is somewhat comparable to a row in adb (i think). The experiments I ran so far filtered everything which hada modification date field, those should only be on documents whichdescribe artifacts. I used a tool called Luke to inspect the index whichI am not an expert in. The raw data is also 14 GB so you can't quicklylook at it and know what is in it or what takes up most of the space.


Here are the fields which once were in the data, most aren't used anymore:

https://maven.apache.org/maven-indexer-archives/maven-indexer-LATEST/indexer-core/

e.g removing the sha1 field cut the lucene index almost in half, sincethose things don't compress very well or cause other overhead.

  Perhaps an additional condition could be added, "older than 1 year _and_ there are 
newer versions of this artifact in the cache".

the proposal upstream does not filter the "cache", that would be slowand would not have the desired effect of reduced on-disk footprintunless the index is rebuild (since lucene storage uses immutable files).It filters during extraction of the raw data of the remote index beforeit is put into a lucene index which represents the remote repository(e.g central but it works with any other too, apache or acompany-internal one etc). (another advantage is that the filter wouldbe a step in an already multi threaded extraction pipeline without extrasteps)

So if you set the cutoff filter to 2 years, and use the same cache for 1year, there gonna be 3 years of artifact metadata in your index.

Also: if a lib is in your local .m2 folder already (even snapshots youbuild), it is in a separate index for local repos - this one isn'tfiltered either. The index footprint there is also tiny (25 MB for a 4GB.m2 folder in my case)


-mbien


-- Eirik

-----Original Message-----
From: Michael Bien <[email protected]>
Sent: Friday, March 17, 2023 8:26 PM
To: [email protected]; Antonio <[email protected]>
Subject: Re: maven indexing tweaks

On 17.03.23 22:38, Antonio wrote:

Hi,

These are impressive savings!

yeah I am pretty happy about the results too. Esp the removal of the
sha1 field had a great effect. Technically we do actually offer this as query 
through the public API, however, it doesn't appear as anything is using it - i 
have to take another look just to be sure. Even if something does we could make 
it an option in the settings.


Out of curiosity, we don't build the index incrementally using Maven's
IndexReader, do we? That's why we download the whole index, right?

first use will download the whole copy, weekly updates are incremental.
And yes it uses DefaultIndexReader (and the updater) of the maven-indexer 
project.

Which is the reason why we have to make some tweaks upstream to get more 
flexibility (and filtering). For example some time in future we might want to 
change where the temp extraction storage is, which maven-indexer uses, which is 
also part of the proposed PR upstream right now.

https://repo1.maven.org/maven2/.index/ has the compressed data for central, 
(apache etc have their own locations but those indices are smaller so you 
barely notice anything)

Currently the lucene index isn't moved into new NetBeans config from old 
caches. This is something we could take a look at too but things like this are 
super annoying to test + risky since someone will find a way to import an index 
from a 10 year old backup and report that something fails (just like users who 
try to import nb-javac from NB 12.x which which breaks pretty much everything).

-mbien

Thanks,
Antonio


[1]

https://maven.apache.org/maven-indexer/indexer-reader/apidocs/org/apac
he/maven/index/reader/IndexReader.html


On 17/3/23 11:06, Michael Bien wrote:

Hello everyone,

I experimented a bit with the maven index extraction process and got
some pretty good results (I think).

There might be a way to filter the index during extraction without
noteworthy overhead, which allows the following:

   - "sliding window" time filters, e.g drop all documents older than
2 years (aka: who uses old libraries?)

   - we can drop fields we don't need from the index. Esp interesting
for fields which don't compress well (looking at you, sha1 hash)

some results for the time cutoff filter:

full: 5.6 GB
2y: 2.6 GB
1y: 1.4 GB

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: maven indexing tweaks

Reply via email to