Re: [DISCUSS] maven remote repo indexer improvements

Jakub Herkel Tue, 18 Apr 2023 12:47:41 -0700

I would like to ask you if there is (will be) some configuration
option (via property for example) in which directory remote index
processing is?
I have a problem on my notebook that all processing is in /tmp
directory (tmpfs) with size 8GB and it always stops with an out of
space exception. And it is little bit awkward if every opening of
pom.xml results in downloading remote index and processing it with no
success.


best regards

jakub

On Tue, Apr 18, 2023 at 9:11 PM Michael Bien <[email protected]> wrote:
>
> my thoughts so far on how I wanted to implement it:
>
>   - time cutoff filter would be optional, configurable in the usual
> maven/indexer options. Quick tests showed that full/2years/1year might
> be reasonable values
>
>   - sha1 filter would be applied by default, this cuts index size in
> half and currently no NB feature is running sha1 queries, this would be
> also a candidate for substitution via a online service - if we really
> don't need it we deprecate the queries. Hashes compress badly.
>
>   - multi threaded extraction would be optional, potentially enabled by
> default. This has a slight index size penalty due to merge overhead, so
> I don't want to enable this without filters. Second concern are molten
> notebooks, some are not made for sustained load and might prefer to run
> the task in the background for 15mins instead of 6mins with loud fans.
>
>
> the first query which is planned to be augmented by a online service is
> class name search. Since this data wasn't in the index anymore for years
> (nobody noticed though?).
>
> and yeah index updates should run faster. But this is on the bottom of
> the todo list - low hanging fruits first :)
>
>
> best regards,
> michael
>
>
> On 18.04.23 20:36, Matthias Bläsing wrote:
> > Hi,
> >
> > Am Dienstag, dem 18.04.2023 um 07:48 +0200 schrieb Jan Lahoda:
> >> I apologize for being contrarian, but since the index download
> >> started for me (again) while on a bus with very poor internet
> >> connection, I guess I should tell you my view.
> > no reason to apologize.
> >
> >> Unless I am mistaken, the index gz has currently roughly 1.9GB, and
> >> it tooks several minutes to actually create the Lucene index from it,
> >> consuming some more space and CPU.
> >>
> >> To be honest, it never seemed very polite to me to download and
> >> process so much without asking.
> >>
> >> I guess alternatives that I would see would include (combination of
> >> options possible):
> >> - explicitly ask before downloading (possibly allowing the user to
> >> select auto-download)
> > Yes, if people get notified, that they'll get the full index locally,
> > then I'm okk with that. I see a problem if features silently give
> > outdated answers or don't work at all. Else we'll get "NetBeans
> > suggested version X, but Y is already on central, why is this not
> > current?".
> >
> >> - have the features that use the index do some query on a server, if
> >> there isn't a downloaded index (or if it is stale/obsolete)
> > IMHO this highly depends on the speed of the API. If the latency is
> > high, the next bug will be "It takes ages until my POM tells me, that
> > it is outdated".
> >
> >> - given that https://github.com/apache/netbeans/pull/4999 produces a
> >> smaller index, we could have a download location (server) at least
> >> for maven central that would serve this optimized index. If I
> >> understand it properly, the smallest index under that PR is 0.8GB,
> >> and if it would compress reasonably well, it might be (say) 0.5GB
> >> compressed - much better than 1.9GB, and no significant CPU usage
> >> after the index is downloaded. (Even if it was 0.8GB, it is still
> >> much better than 1.9GB+CPU churn.)
> > Truncating the index needs to be done carefully. NetBeans has a search
> > my SHA1 (or MD5?) feature. That will break, if you remove that data
> > from the index. A similar situation will arise, if arbitrary cut offs
> > are done based on time. Consider a libary, that does some interesting
> > algorithm, that just works the same even after years. If we cut the
> > index at 6 months for example, that artifact won't be found anymore.
> >
> >> There was also an argument on conserving the ASF resources in another
> >> discussion recently. If I consider there would be (only) 10 000
> >> installations of NetBeans, with the default setting to download the
> >> index once a week, it is almost 20TB of data every week if I count
> >> correctly. +the CPU cycles to convert the index on user's machines.
> >> It seems there may be a way to conserve the ASF resources and provide
> >> better experience to the users at the same time.
> > The download is from sonatypes CDN. Given that they actively discourage
> > central mirrors, I have not to much concern here. It is also the the
> > resourced of the ASF.
> >
> > Greetings
> >
> > Matthias
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> > For further information about the NetBeans mailing lists, visit:
> > https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: [DISCUSS] maven remote repo indexer improvements

Reply via email to