Just addressing [1] I believe there is a simple workaround. Here's a unit test demonstrating:
public void testExcessivelyLargeVector() throws Exception { IndexableFieldType vector2048 = new FieldType() { @Override public int vectorDimension() { return 2048; } @Override public VectorEncoding vectorEncoding() { return VectorEncoding.FLOAT32; } @Override public VectorSimilarityFunction vectorSimilarityFunction() { return VectorSimilarityFunction.EUCLIDEAN; } }; try (Directory dir = newDirectory(); IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(null).setCodec(codec))) { Document doc = new Document(); FieldType type = new FieldType(vector2048); doc.add(new KnnVectorField("vector2048", new float[2048], type)); iw.addDocument(doc); } } On Wed, Apr 12, 2023 at 8:10 AM Alessandro Benedetti <a.benede...@sease.io> wrote: > > My tentative of listing here only a set of proposals to then vote, has > unfortunately failed. > > I appreciate the discussion on better benchmarking hnsw but my feeling is > that this discussion is orthogonal to the limit discussion itself, should we > create a separate mail thread/github jira issue for that? > > At the moment I see at least three lines of activities as an outcome from > this (maybe too long) discussion: > > 1) [small task] there's a need from a good amount of people of > increasing/removing the max limit, as an enabler, to get more users to Lucene > and ease adoption for systems Lucene based (Apache Solr, Elasticsearch, > OpenSearch) > > 2) [medium task] we all want more benchmarks for Lucene vector-based search, > with a good variety of vector dimensions and encodings > > 3) [big task? ] some people would like to improve vector based search > peformance because currently not acceptable, it's not clear when and how > > A question I have for point 1, does it really need to be a one way door? > Can't we reduce the max limit in the future if the implementation becomes > coupled with certain dimension sizes? > It's not ideal I agree, but is back-compatibility more important than > pragmatic benefits? > I. E. > Right now there's no implementation coupled with the max limit - > we > remove/increase the limit and get more Users > > With Lucene X.Y a clever committer introduces a super nice implementation > improvements that unfortunately limit the max size to K. > Can't we just document it as a breaking change for such release? So at that > point we won't support >K vectors but for a reason? > > Do we have similar precedents in Lucene? > > > > On Wed, 12 Apr 2023, 08:36 Michael Wechner, <michael.wech...@wyona.com> wrote: >> >> thank you very much for your feedback! >> >> In a previous post (April 7) you wrote you could make availlable the 47K >> ada-002 vectors, which would be great! >> >> Would it make sense to setup a public gitub repo, such that others could use >> or also contribute vectors? >> >> Thanks >> >> Michael Wechner >> >> >> Am 12.04.23 um 04:51 schrieb Kent Fitch: >> >> I only know some characteristics of the openAI ada-002 vectors, although >> they are a very popular as embeddings/text-characterisations as they allow >> more accurate/"human meaningful" semantic search results with fewer >> dimensions than their predecessors - I've evaluated a few different >> embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 >> dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 >> (2048 dims), and ada-002 are qualitatively the best, although that will >> certainly change! >> >> In any case, ada-002 vectors have interesting characteristics that I think >> mean you could confidently create synthetic vectors which would be hard to >> distinguish from "real" vectors. I found this from looking at 47K ada-002 >> vectors generated across a full year (1994) of newspaper articles from the >> Canberra Times and 200K wikipedia articles: >> - there is no discernible/significant correlation between values in any pair >> of dimensions >> - all but 5 of the 1536 dimensions have an almost identical distribution of >> values shown in the central blob on these graphs (that just show a few of >> these 1531 dimensions with clumped values and the 5 "outlier" dimensions, >> but all 1531 non-outlier dims are in there, which makes for some easy >> quantisation from float to byte if you dont want to go the full >> kmeans/clustering/Lloyds-algorithm approach): >> https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing >> https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing >> https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing >> - the variance of the value of each dimension is characteristic: >> https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228 >> >> This probably represents something significant about how the ada-002 >> embeddings are created, but I think it also means creating "realistic" >> values is possible. I did not use this information when testing recall & >> performance on Lucene's HNSW implementation on 192m documents, as I slightly >> dithered the values of a "real" set on 47K docs and stored other fields in >> the doc that referenced the "base" document that the dithers were made from, >> and used different dithering magnitudes so that I could test recall with >> different neighbour sizes ("M"), construction-beamwidth and >> search-beamwidths. >> >> best regards >> >> Kent Fitch >> >> >> >> >> On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner <michael.wech...@wyona.com> >> wrote: >>> >>> I understand what you mean that it seems to be artificial, but I don't >>> understand why this matters to test performance and scalability of the >>> indexing? >>> >>> Let's assume the limit of Lucene would be 4 instead of 1024 and there >>> are only open source models generating vectors with 4 dimensions, for >>> example >>> >>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814 >>> >>> 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 >>> >>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106 >>> >>> -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 >>> >>> and now I concatenate them to vectors with 8 dimensions >>> >>> >>> 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 >>> >>> -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 >>> >>> and normalize them to length 1. >>> >>> Why should this be any different to a model which is acting like a black >>> box generating vectors with 8 dimensions? >>> >>> >>> >>> >>> Am 11.04.23 um 19:05 schrieb Michael Sokolov: >>> >> What exactly do you consider real vector data? Vector data which is >>> >> based on texts written by humans? >>> > We have plenty of text; the problem is coming up with a realistic >>> > vector model that requires as many dimensions as people seem to be >>> > demanding. As I said above, after surveying huggingface I couldn't >>> > find any text-based model using more than 768 dimensions. So far we >>> > have some ideas of generating higher-dimensional data by dithering or >>> > concatenating existing data, but it seems artificial. >>> > >>> > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner >>> > <michael.wech...@wyona.com> wrote: >>> >> What exactly do you consider real vector data? Vector data which is >>> >> based on texts written by humans? >>> >> >>> >> I am asking, because I recently attended the following presentation by >>> >> Anastassia Shaitarova (UZH Institute for Computational Linguistics, >>> >> https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html) >>> >> >>> >> ---- >>> >> >>> >> Can we Identify Machine-Generated Text? An Overview of Current Approaches >>> >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics) >>> >> >>> >> The detection of machine-generated text has become increasingly >>> >> important due to the prevalence of automated content generation and its >>> >> potential for misuse. In this talk, we will discuss the motivation for >>> >> automatic detection of generated text. We will present the currently >>> >> available methods, including feature-based classification as a “first >>> >> line-of-defense.” We will provide an overview of the detection tools >>> >> that have been made available so far and discuss their limitations. >>> >> Finally, we will reflect on some open problems associated with the >>> >> automatic discrimination of generated texts. >>> >> >>> >> ---- >>> >> >>> >> and her conclusion was that it has become basically impossible to >>> >> differentiate between text generated by humans and text generated by for >>> >> example ChatGPT. >>> >> >>> >> Whereas others have a slightly different opinion, see for example >>> >> >>> >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/ >>> >> >>> >> But I would argue that real world and synthetic have become close enough >>> >> that testing performance and scalability of indexing should be possible >>> >> with synthetic data. >>> >> >>> >> I completely agree that we have to base our discussions and decisions on >>> >> scientific methods and that we have to make sure that Lucene performs >>> >> and scales well and that we understand the limits and what is going on >>> >> under the hood. >>> >> >>> >> Thanks >>> >> >>> >> Michael W >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> Am 11.04.23 um 14:29 schrieb Michael McCandless: >>> >> >>> >> +1 to test on real vector data -- if you test on synthetic data you draw >>> >> synthetic conclusions. >>> >> >>> >> Can someone post the theoretical performance (CPU and RAM required) of >>> >> HNSW construction? Do we know/believe our HNSW implementation has >>> >> achieved that theoretical big-O performance? Maybe we have some silly >>> >> performance bug that's causing it not to? >>> >> >>> >> As I understand it, HNSW makes the tradeoff of costly construction for >>> >> faster searching, which is typically the right tradeoff for search use >>> >> cases. We do this in other parts of the Lucene index too. >>> >> >>> >> Lucene will do a logarithmic number of merges over time, i.e. each doc >>> >> will be merged O(log(N)) times in its lifetime in the index. We need to >>> >> multiply that by the cost of re-building the whole HNSW graph on each >>> >> merge. BTW, other things in Lucene, like BKD/dimensional points, also >>> >> rebuild the whole data structure on each merge, I think? But, as Rob >>> >> pointed out, stored fields merging do indeed do some sneaky tricks to >>> >> avoid excessive block decompress/recompress on each merge. >>> >> >>> >>> As I understand it, vetoes must have technical merit. I'm not sure that >>> >>> this veto rises to "technical merit" on 2 counts: >>> >> Actually I think Robert's veto stands on its technical merit already. >>> >> Robert's take on technical matters very much resonate with me, even if >>> >> he is sometimes prickly in how he expresses them ;) >>> >> >>> >> His point is that we, as a dev community, are not paying enough >>> >> attention to the indexing performance of our KNN algo (HNSW) and >>> >> implementation, and that it is reckless to increase / remove limits in >>> >> that state. It is indeed a one-way door decision and one must confront >>> >> such decisions with caution, especially for such a widely used base >>> >> infrastructure as Lucene. We don't even advertise today in our javadocs >>> >> that you need XXX heap if you index vectors with dimension Y, fanout X, >>> >> levels Z, etc. >>> >> >>> >> RAM used during merging is unaffected by dimensionality, but is affected >>> >> by fanout, because the HNSW graph (not the raw vectors) is memory >>> >> resident, I think? Maybe we could move it off-heap and let the OS >>> >> manage the memory (and still document the RAM requirements)? Maybe >>> >> merge RAM costs should be accounted for in IW's RAM buffer accounting? >>> >> It is not today, and there are some other things that use non-trivial >>> >> RAM, e.g. the doc mapping (to compress docid space when deletions are >>> >> reclaimed). >>> >> >>> >> When we added KNN vector testing to Lucene's nightly benchmarks, the >>> >> indexing time massively increased -- see annotations DH and DP here: >>> >> https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly >>> >> benchmarks now start at 6 PM and don't finish until ~14.5 hours later. >>> >> Of course, that is using a single thread for indexing (on a box that has >>> >> 128 cores!) so we produce a deterministic index every night ... >>> >> >>> >> Stepping out (meta) a bit ... this discussion is precisely one of the >>> >> awesome benefits of the (informed) veto. It means risky changes to the >>> >> software, as determined by any single informed developer on the project, >>> >> can force a healthy discussion about the problem at hand. Robert is >>> >> legitimately concerned about a real issue and so we should use our >>> >> creative energies to characterize our HNSW implementation's performance, >>> >> document it clearly for users, and uncover ways to improve it. >>> >> >>> >> Mike McCandless >>> >> >>> >> http://blog.mikemccandless.com >>> >> >>> >> >>> >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti >>> >> <a.benede...@sease.io> wrote: >>> >>> I think Gus points are on target. >>> >>> >>> >>> I recommend we move this forward in this way: >>> >>> We stop any discussion and everyone interested proposes an option with >>> >>> a motivation, then we aggregate the options and we create a Vote maybe? >>> >>> >>> >>> I am also on the same page on the fact that a veto should come with a >>> >>> clear and reasonable technical merit, which also in my opinion has not >>> >>> come yet. >>> >>> >>> >>> I also apologise if any of my words sounded harsh or personal attacks, >>> >>> never meant to do so. >>> >>> >>> >>> My proposed option: >>> >>> >>> >>> 1) remove the limit and potentially make it configurable, >>> >>> Motivation: >>> >>> The system administrator can enforce a limit its users need to respect >>> >>> that it's in line with whatever the admin decided to be acceptable for >>> >>> them. >>> >>> Default can stay the current one. >>> >>> >>> >>> That's my favourite at the moment, but I agree that potentially in the >>> >>> future this may need to change, as we may optimise the data structures >>> >>> for certain dimensions. I am a big fan of Yagni (you aren't going to >>> >>> need it) so I am ok we'll face a different discussion if that happens >>> >>> in the future. >>> >>> >>> >>> >>> >>> >>> >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote: >>> >>>> What I see so far: >>> >>>> >>> >>>> Much positive support for raising the limit >>> >>>> Slightly less support for removing it or making it configurable >>> >>>> A single veto which argues that a (as yet undefined) performance >>> >>>> standard must be met before raising the limit >>> >>>> Hot tempers (various) making this discussion difficult >>> >>>> >>> >>>> As I understand it, vetoes must have technical merit. I'm not sure >>> >>>> that this veto rises to "technical merit" on 2 counts: >>> >>>> >>> >>>> No standard for the performance is given so it cannot be technically >>> >>>> met. Without hard criteria it's a moving target. >>> >>>> It appears to encode a valuation of the user's time, and that >>> >>>> valuation is really up to the user. Some users may consider 2hours >>> >>>> useless and not worth it, and others might happily wait 2 hours. This >>> >>>> is not a technical decision, it's a business decision regarding the >>> >>>> relative value of the time invested vs the value of the result. If I >>> >>>> can cure cancer by indexing for a year, that might be worth it... >>> >>>> (hyperbole of course). >>> >>>> >>> >>>> Things I would consider to have technical merit that I don't hear: >>> >>>> >>> >>>> Impact on the speed of **other** indexing operations. (devaluation of >>> >>>> other functionality) >>> >>>> Actual scenarios that work when the limit is low and fail when the >>> >>>> limit is high (new failure on the same data with the limit raised). >>> >>>> >>> >>>> One thing that might or might not have technical merit >>> >>>> >>> >>>> If someone feels there is a lack of documentation of the >>> >>>> costs/performance implications of using large vectors, possibly >>> >>>> including reproducible benchmarks establishing the scaling behavior >>> >>>> (there seems to be disagreement on O(n) vs O(n^2)). >>> >>>> >>> >>>> The users *should* know what they are getting into, but if the cost is >>> >>>> worth it to them, they should be able to pay it without forking the >>> >>>> project. If this veto causes a fork that's not good. >>> >>>> >>> >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> >>> >>>> wrote: >>> >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in >>> >>>>> 100 and 300 dimensional varieties and can easily enough generate >>> >>>>> large numbers of vector documents from the articles data. To go >>> >>>>> higher we could concatenate vectors from that and I believe the >>> >>>>> performance numbers would be plausible. >>> >>>>> >>> >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> >>> >>>>> wrote: >>> >>>>>> Can we set up a branch in which the limit is bumped to 2048, then >>> >>>>>> have >>> >>>>>> a realistic, free data set (wikipedia sample or something) that has, >>> >>>>>> say, 5 million docs and vectors created using public data (glove >>> >>>>>> pre-trained embeddings or the like)? We then could run indexing on >>> >>>>>> the >>> >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, >>> >>>>>> limits >>> >>>>>> and behavior actually are. >>> >>>>>> >>> >>>>>> I can help in writing this but not until after Easter. >>> >>>>>> >>> >>>>>> >>> >>>>>> Dawid >>> >>>>>> >>> >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> >>> >>>>>> wrote: >>> >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for >>> >>>>>>> Apache projects: a single -1 vote on a code change is a veto and >>> >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on >>> >>>>>>> this >>> >>>>>>> project who worked the most on debugging subtle bugs, making Lucene >>> >>>>>>> more robust and improving our test framework, so I'm listening when >>> >>>>>>> he >>> >>>>>>> voices quality concerns. >>> >>>>>>> >>> >>>>>>> The argument against removing/raising the limit that resonates with >>> >>>>>>> me >>> >>>>>>> the most is that it is a one-way door. As MikeS highlighted earlier >>> >>>>>>> on >>> >>>>>>> this thread, implementations may want to take advantage of the fact >>> >>>>>>> that there is a limit at some point too. This is why I don't want to >>> >>>>>>> remove the limit and would prefer a slight increase, such as 2048 as >>> >>>>>>> suggested in the original issue, which would enable most of the >>> >>>>>>> things >>> >>>>>>> that users who have been asking about raising the limit would like >>> >>>>>>> to >>> >>>>>>> do. >>> >>>>>>> >>> >>>>>>> I agree that the merge-time memory usage and slow indexing rate are >>> >>>>>>> not great. But it's still possible to index multi-million vector >>> >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the >>> >>>>>>> number of dimensions, and the feedback I'm seeing is that many users >>> >>>>>>> are still interested in indexing multi-million vector datasets >>> >>>>>>> despite >>> >>>>>>> the slow indexing rate. I wish we could do better, and vector >>> >>>>>>> indexing >>> >>>>>>> is certainly more expert than text indexing, but it still is usable >>> >>>>>>> in >>> >>>>>>> my opinion. I understand how giving Lucene more information about >>> >>>>>>> vectors prior to indexing (e.g. clustering information as Jim >>> >>>>>>> pointed >>> >>>>>>> out) could help make merging faster and more memory-efficient, but I >>> >>>>>>> would really like to avoid making it a requirement for indexing >>> >>>>>>> vectors as it also makes this feature much harder to use. >>> >>>>>>> >>> >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti >>> >>>>>>> <a.benede...@sease.io> wrote: >>> >>>>>>>> I am very attentive to listen opinions but I am un-convinced here >>> >>>>>>>> and I an not sure that a single person opinion should be allowed >>> >>>>>>>> to be detrimental for such an important project. >>> >>>>>>>> >>> >>>>>>>> The limit as far as I know is literally just raising an exception. >>> >>>>>>>> Removing it won't alter in any way the current performance for >>> >>>>>>>> users in low dimensional space. >>> >>>>>>>> Removing it will just enable more users to use Lucene. >>> >>>>>>>> >>> >>>>>>>> If new users in certain situations will be unhappy with the >>> >>>>>>>> performance, they may contribute improvements. >>> >>>>>>>> This is how you make progress. >>> >>>>>>>> >>> >>>>>>>> If it's a reputation thing, trust me that not allowing users to >>> >>>>>>>> play with high dimensional space will equally damage it. >>> >>>>>>>> >>> >>>>>>>> To me it's really a no brainer. >>> >>>>>>>> Removing the limit and enable people to use high dimensional >>> >>>>>>>> vectors will take minutes. >>> >>>>>>>> Improving the hnsw implementation can take months. >>> >>>>>>>> Pick one to begin with... >>> >>>>>>>> >>> >>>>>>>> And there's no-one paying me here, no company interest whatsoever, >>> >>>>>>>> actually I pay people to contribute, I am just convinced it's a >>> >>>>>>>> good idea. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: >>> >>>>>>>>> I disagree with your categorization. I put in plenty of work and >>> >>>>>>>>> experienced plenty of pain myself, writing tests and fighting >>> >>>>>>>>> these >>> >>>>>>>>> issues, after i saw that, two releases in a row, vector indexing >>> >>>>>>>>> fell >>> >>>>>>>>> over and hit integer overflows etc on small datasets: >>> >>>>>>>>> >>> >>>>>>>>> https://github.com/apache/lucene/pull/11905 >>> >>>>>>>>> >>> >>>>>>>>> Attacking me isn't helping the situation. >>> >>>>>>>>> >>> >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean it >>> >>>>>>>>> in >>> >>>>>>>>> any kind of demeaning fashion really. I meant to describe the >>> >>>>>>>>> current >>> >>>>>>>>> state of usability with respect to indexing a few million docs >>> >>>>>>>>> with >>> >>>>>>>>> high dimensions. You can scroll up the thread and see that at >>> >>>>>>>>> least >>> >>>>>>>>> one other committer on the project experienced similar pain as me. >>> >>>>>>>>> Then, think about users who aren't committers trying to use the >>> >>>>>>>>> functionality! >>> >>>>>>>>> >>> >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov >>> >>>>>>>>> <msoko...@gmail.com> wrote: >>> >>>>>>>>>> What you said about increasing dimensions requiring a bigger ram >>> >>>>>>>>>> buffer on merge is wrong. That's the point I was trying to make. >>> >>>>>>>>>> Your concerns about merge costs are not wrong, but your >>> >>>>>>>>>> conclusion that we need to limit dimensions is not justified. >>> >>>>>>>>>> >>> >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show >>> >>>>>>>>>> it scales linearly with dimension you just ignore that and >>> >>>>>>>>>> complain about something entirely different. >>> >>>>>>>>>> >>> >>>>>>>>>> You demand that people run all kinds of tests to prove you wrong >>> >>>>>>>>>> but when they do, you don't listen and you won't put in the work >>> >>>>>>>>>> yourself or complain that it's too hard. >>> >>>>>>>>>> >>> >>>>>>>>>> Then you complain about people not meeting you half way. Wow >>> >>>>>>>>>> >>> >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> >>> >>>>>>>>>> wrote: >>> >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >>> >>>>>>>>>>> <michael.wech...@wyona.com> wrote: >>> >>>>>>>>>>>> What exactly do you consider reasonable? >>> >>>>>>>>>>> Let's begin a real discussion by being HONEST about the current >>> >>>>>>>>>>> status. Please put politically correct or your own company's >>> >>>>>>>>>>> wishes >>> >>>>>>>>>>> aside, we know it's not in a good state. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Current status is the one guy who wrote the code can set a >>> >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024 >>> >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware). >>> >>>>>>>>>>> >>> >>>>>>>>>>> My concerns are everyone else except the one guy, I want it to >>> >>>>>>>>>>> be >>> >>>>>>>>>>> usable. Increasing dimensions just means even bigger >>> >>>>>>>>>>> multi-gigabyte >>> >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge. >>> >>>>>>>>>>> It is also a permanent backwards compatibility decision, we >>> >>>>>>>>>>> have to >>> >>>>>>>>>>> support it once we do this and we can't just say "oops" and >>> >>>>>>>>>>> flip it >>> >>>>>>>>>>> back. >>> >>>>>>>>>>> >>> >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is really >>> >>>>>>>>>>> to >>> >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS >>> >>>>>>>>>>> otherwise, >>> >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM. >>> >>>>>>>>>>> Also from personal experience, it takes trial and error (means >>> >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values >>> >>>>>>>>>>> correct >>> >>>>>>>>>>> for your dataset. This usually means starting over which is >>> >>>>>>>>>>> frustrating and wastes more time. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Jim mentioned some ideas about the memory usage in IndexWriter, >>> >>>>>>>>>>> seems >>> >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer >>> >>>>>>>>>>> can be >>> >>>>>>>>>>> avoided in this way and performance improved by writing bigger >>> >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can >>> >>>>>>>>>>> simply >>> >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to >>> >>>>>>>>>>> scale so >>> >>>>>>>>>>> that indexing really scales. >>> >>>>>>>>>>> >>> >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and >>> >>>>>>>>>>> cause OOM, >>> >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in O(n^2) >>> >>>>>>>>>>> fashion when indexing. >>> >>>>>>>>>>> >>> >>>>>>>>>>> --------------------------------------------------------------------- >>> >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>>>>>>>>> >>> >>>>>>>>> --------------------------------------------------------------------- >>> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>>>>>>> >>> >>>>>>> >>> >>>>>>> -- >>> >>>>>>> Adrien >>> >>>>>>> >>> >>>>>>> --------------------------------------------------------------------- >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>>>>> >>> >>>>>> --------------------------------------------------------------------- >>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>>>> >>> >>>> >>> >>>> -- >>> >>>> http://www.needhamsoftware.com (work) >>> >>>> http://www.the111shift.com (play) >>> >> >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: dev-h...@lucene.apache.org >>> > >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org