On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner <michael.wech...@wyona.com> wrote: > > What exactly do you consider reasonable?
Let's begin a real discussion by being HONEST about the current status. Please put politically correct or your own company's wishes aside, we know it's not in a good state. Current status is the one guy who wrote the code can set a multi-gigabyte ram buffer and index a small dataset with 1024 dimensions in HOURS (i didn't ask what hardware). My concerns are everyone else except the one guy, I want it to be usable. Increasing dimensions just means even bigger multi-gigabyte ram buffer and bigger heap to avoid OOM on merge. It is also a permanent backwards compatibility decision, we have to support it once we do this and we can't just say "oops" and flip it back. It is unclear to me, if the multi-gigabyte ram buffer is really to avoid merges because they are so slow and it would be DAYS otherwise, or if its to avoid merges so it doesn't hit OOM. Also from personal experience, it takes trial and error (means experiencing OOM on merge!!!) before you get those heap values correct for your dataset. This usually means starting over which is frustrating and wastes more time. Jim mentioned some ideas about the memory usage in IndexWriter, seems to me like its a good idea. maybe the multigigabyte ram buffer can be avoided in this way and performance improved by writing bigger segments with lucene's defaults. But this doesn't mean we can simply ignore the horrors of what happens on merge. merging needs to scale so that indexing really scales. At least it shouldnt spike RAM on trivial data amounts and cause OOM, and definitely it shouldnt burn hours and hours of CPU in O(n^2) fashion when indexing. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org