Re: [DISCUSS] maven remote repo indexer improvements

Michael Bien Tue, 18 Apr 2023 13:45:35 -0700

Hi Jakub,

not sure. Since tmp is generally the best place for temporary files.Other places would require periodic cleanups in case something happens.The temp folder is only used during extraction, the final index is inyour cache folder.



(this PR would give us more control over some of the locations too:

https://github.com/apache/maven-indexer/pull/302)

if you can't or don't want to increase your temp folder size, you cantell the JVM to use something else.



put this into your netbeans_default_options of netbeans.conf:


-J-Djava.io.tmpdir=/tmp/myothertmp


please make sure this folder is empty!


-mbien


On 18.04.23 21:46, Jakub Herkel wrote:

I would like to ask you if there is (will be) some configuration
option (via property for example) in which directory remote index
processing is?
I have a problem on my notebook that all processing is in /tmp
directory (tmpfs) with size 8GB and it always stops with an out of
space exception. And it is little bit awkward if every opening of
pom.xml results in downloading remote index and processing it with no
success.

best regards

jakub

On Tue, Apr 18, 2023 at 9:11 PM Michael Bien <[email protected]> wrote:

my thoughts so far on how I wanted to implement it:

   - time cutoff filter would be optional, configurable in the usual
maven/indexer options. Quick tests showed that full/2years/1year might
be reasonable values

   - sha1 filter would be applied by default, this cuts index size in
half and currently no NB feature is running sha1 queries, this would be
also a candidate for substitution via a online service - if we really
don't need it we deprecate the queries. Hashes compress badly.

   - multi threaded extraction would be optional, potentially enabled by
default. This has a slight index size penalty due to merge overhead, so
I don't want to enable this without filters. Second concern are molten
notebooks, some are not made for sustained load and might prefer to run
the task in the background for 15mins instead of 6mins with loud fans.


the first query which is planned to be augmented by a online service is
class name search. Since this data wasn't in the index anymore for years
(nobody noticed though?).

and yeah index updates should run faster. But this is on the bottom of
the todo list - low hanging fruits first :)


best regards,
michael


On 18.04.23 20:36, Matthias Bläsing wrote:

Hi,

Am Dienstag, dem 18.04.2023 um 07:48 +0200 schrieb Jan Lahoda:

I apologize for being contrarian, but since the index download
started for me (again) while on a bus with very poor internet
connection, I guess I should tell you my view.

no reason to apologize.

Unless I am mistaken, the index gz has currently roughly 1.9GB, and
it tooks several minutes to actually create the Lucene index from it,
consuming some more space and CPU.

To be honest, it never seemed very polite to me to download and
process so much without asking.

I guess alternatives that I would see would include (combination of
options possible):
- explicitly ask before downloading (possibly allowing the user to
select auto-download)

Yes, if people get notified, that they'll get the full index locally,
then I'm okk with that. I see a problem if features silently give
outdated answers or don't work at all. Else we'll get "NetBeans
suggested version X, but Y is already on central, why is this not
current?".

- have the features that use the index do some query on a server, if
there isn't a downloaded index (or if it is stale/obsolete)

IMHO this highly depends on the speed of the API. If the latency is
high, the next bug will be "It takes ages until my POM tells me, that
it is outdated".

- given that https://github.com/apache/netbeans/pull/4999 produces a
smaller index, we could have a download location (server) at least
for maven central that would serve this optimized index. If I
understand it properly, the smallest index under that PR is 0.8GB,
and if it would compress reasonably well, it might be (say) 0.5GB
compressed - much better than 1.9GB, and no significant CPU usage
after the index is downloaded. (Even if it was 0.8GB, it is still
much better than 1.9GB+CPU churn.)

Truncating the index needs to be done carefully. NetBeans has a search
my SHA1 (or MD5?) feature. That will break, if you remove that data
from the index. A similar situation will arise, if arbitrary cut offs
are done based on time. Consider a libary, that does some interesting
algorithm, that just works the same even after years. If we cut the
index at 6 months for example, that artifact won't be found anymore.

There was also an argument on conserving the ASF resources in another
discussion recently. If I consider there would be (only) 10 000
installations of NetBeans, with the default setting to download the
index once a week, it is almost 20TB of data every week if I count
correctly. +the CPU cycles to convert the index on user's machines.
It seems there may be a way to conserve the ASF resources and provide
better experience to the users at the same time.

The download is from sonatypes CDN. Given that they actively discourage
central mirrors, I have not to much concern here. It is also the the
resourced of the ASF.

Greetings

Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

For further information about the NetBeans mailing lists, visit:
https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists

Re: [DISCUSS] maven remote repo indexer improvements

Reply via email to