On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
>
> What exactly do you consider reasonable?

Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.

At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to