Re: [DISCUSS] maven remote repo indexer improvements

Michael Bien Tue, 18 Apr 2023 12:11:11 -0700

my thoughts so far on how I wanted to implement it:

- time cutoff filter would be optional, configurable in the usualmaven/indexer options. Quick tests showed that full/2years/1year mightbe reasonable values

- sha1 filter would be applied by default, this cuts index size inhalf and currently no NB feature is running sha1 queries, this would bealso a candidate for substitution via a online service - if we reallydon't need it we deprecate the queries. Hashes compress badly.

- multi threaded extraction would be optional, potentially enabled bydefault. This has a slight index size penalty due to merge overhead, soI don't want to enable this without filters. Second concern are moltennotebooks, some are not made for sustained load and might prefer to runthe task in the background for 15mins instead of 6mins with loud fans.

the first query which is planned to be augmented by a online service isclass name search. Since this data wasn't in the index anymore for years(nobody noticed though?).

and yeah index updates should run faster. But this is on the bottom ofthe todo list - low hanging fruits first :)



best regards,
michael


On 18.04.23 20:36, Matthias Bläsing wrote:

Hi,

Am Dienstag, dem 18.04.2023 um 07:48 +0200 schrieb Jan Lahoda:

I apologize for being contrarian, but since the index download
started for me (again) while on a bus with very poor internet
connection, I guess I should tell you my view.

no reason to apologize.

Unless I am mistaken, the index gz has currently roughly 1.9GB, and
it tooks several minutes to actually create the Lucene index from it,
consuming some more space and CPU.

To be honest, it never seemed very polite to me to download and
process so much without asking.

I guess alternatives that I would see would include (combination of
options possible):
- explicitly ask before downloading (possibly allowing the user to
select auto-download)

Yes, if people get notified, that they'll get the full index locally,
then I'm okk with that. I see a problem if features silently give
outdated answers or don't work at all. Else we'll get "NetBeans
suggested version X, but Y is already on central, why is this not
current?".

- have the features that use the index do some query on a server, if
there isn't a downloaded index (or if it is stale/obsolete)

IMHO this highly depends on the speed of the API. If the latency is
high, the next bug will be "It takes ages until my POM tells me, that
it is outdated".

- given that https://github.com/apache/netbeans/pull/4999 produces a
smaller index, we could have a download location (server) at least
for maven central that would serve this optimized index. If I
understand it properly, the smallest index under that PR is 0.8GB,
and if it would compress reasonably well, it might be (say) 0.5GB
compressed - much better than 1.9GB, and no significant CPU usage
after the index is downloaded. (Even if it was 0.8GB, it is still
much better than 1.9GB+CPU churn.)

Truncating the index needs to be done carefully. NetBeans has a search
my SHA1 (or MD5?) feature. That will break, if you remove that data
from the index. A similar situation will arise, if arbitrary cut offs
are done based on time. Consider a libary, that does some interesting
algorithm, that just works the same even after years. If we cut the
index at 6 months for example, that artifact won't be found anymore.

There was also an argument on conserving the ASF resources in another
discussion recently. If I consider there would be (only) 10 000
installations of NetBeans, with the default setting to download the
index once a week, it is almost 20TB of data every week if I count
correctly. +the CPU cycles to convert the index on user's machines.
It seems there may be a way to conserve the ASF resources and provide
better experience to the users at the same time.

The download is from sonatypes CDN. Given that they actively discourage
central mirrors, I have not to much concern here. It is also the the
resourced of the ASF.

Greetings

Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: [DISCUSS] maven remote repo indexer improvements

Reply via email to