Re: RFC: N-2 compatibility for file formats
thanks for all the feedback, I opened https://issues.apache.org/jira/browse/LUCENE-9669 to address this further. On Wed, Jan 13, 2021 at 2:58 PM Adrien Grand wrote: > > +1 this strikes to me as a good balance between increasing backward > compatibility guarantees and still keeping room for innovation. > > David, actually I would like to advocate in favor of still disallowing > opening N-2 indices by default, as they might not match Lucene's current > expectations (e.g. using a different encoding for norms due to LUCENE-7730), > and using Lucene's current analyzers/similarities/queries might trigger > surprising behavior. My preference would be to expose the ability to open N-2 > indices behind an expert API/flag that documents limitations with N-2 indices. > > Mike, I wondered about this question too. As you pointed out, I think that we > will generally be ok given that the N-2 compatibility layer will very likely > be the same as the N-1 compatibility layer that we need to develop anyway. I > tried to think of examples when that wouldn't work but couldn't find any > (which doesn't mean that there is none, but hopefully it would be rare). > > > > On Mon, Jan 11, 2021 at 4:57 PM Michael McCandless > wrote: >> >> +1, I like the idea in general. >> >> We will have to work out the details in practice as we come across "index >> breaking" changes, and where/how to draw the line of "best effort". But I >> think this is an improvement for our users over the hard check we now have >> for "only N-1", and likely not so much development effort? >> >> I think where it might get interesting is if we want to make a Codec API >> change, maybe to optimize a interesting use-cases, and then we must do some >> development to fix N-2 BWC codec (as well as N-1 BWC codec that we already >> must fix for such an example, today). >> >> Some users seem to keep their indices alive for a very long time! >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Sat, Jan 9, 2021 at 6:13 AM Simon Willnauer >> wrote: >>> >>> I can provide some examples of BWC issues and what we would do if it >>> happened in the future: >>> >>> - negative offsets: in this case it would be best effort to add a >>> wrapper around the older formats to check if the offsets go backwards >>> on the read side and throw an exception to prevent consumers making >>> the assumption that offsets go forward only from failing or going OOM >>> etc. >>> - norms encoding: in this case it would be best effort in the older >>> norms formats to convert to the newer encodings. >>> - the removal of numeric fields queries would not fall under the >>> promises we make with compatibility of N-2 and it would be the >>> responsibility of the user to keep the code around that understands >>> the value of a field. >>> >>> I hope this clarifies some of the aspects? >>> >>> we would only do all this for the reading end, for writing we would >>> reject indices that are older than N-1 >>> >>> simon >>> >>> >>> On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi wrote: >>> > >>> > The proposal is only about keeping the ability to read file-format up to >>> > N-2. Everything that is done on top of the file format is not guaranteed >>> > and should be supported on a best-effort basis. >>> > That's an important aspect if we don't want to block innovation. So in >>> > practice that means that queries that require some specific file format >>> > or analyzers that change behaviors in major versions would not be part of >>> > the extended guarantee. >>> > >>> > >>> > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley a écrit : >>> >> >>> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer >>> >> wrote: >>> >>> >>> >>> You can open a reader on an index created by >>> >>> version N-2, but you cannot open an IndexWriter on it >>> >> >>> >> >>> >> +1 >>> >> There should definitely be more consideration given to back compat in >>> >> general... it's caused a ton of pain to users over time. >>> >> >>> >> -Yonik >>> >> >>> >> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> > > > -- > Adrien - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Add maxFields Option to IndexWriter
I personally have pretty positive experience with what I call softlimits. At elastic we use them all over the place to catch issues when a user likely misconfigures something or if there is likely a issue on the users end. I think having an option on the IW that allows to limit the fieldnumbers. We can even extract a general limits object with total num docs etc. if we want. We can still set stuff to unlimited by default. WDYT Sent from a mobile device > On 14. Jan 2021, at 06:36, David Smiley wrote: > > > I don't like the idea of IndexWriter limiting field names, but I do like the > idea of un-deprecating that method, which appeared to have a trivial > implementation. Try commenting on the issue of it's deprecations, which has > various watchers to get their attention. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > >> On Wed, Jan 13, 2021 at 5:02 PM Oren Ovadia >> wrote: >> Hi All, >> >> I work on Lucene at MongoDB. >> >> I would like to limit the amount of fields in an index to prevent tenants >> from causing a mapping explosion. >> >> Since IndexWriter.getFieldNames has been deprecated, there is no way to do >> this without using a reader (which comes with a set of problems regarding >> flush/commit rates). >> >> Would love to add to Lucene the ability to have IndexWriters limiting the >> number of fields. Curious to hear your thoughts. >> >> Thanks, >> Oren >>
Re: RFC: N-2 compatibility for file formats
I can provide some examples of BWC issues and what we would do if it happened in the future: - negative offsets: in this case it would be best effort to add a wrapper around the older formats to check if the offsets go backwards on the read side and throw an exception to prevent consumers making the assumption that offsets go forward only from failing or going OOM etc. - norms encoding: in this case it would be best effort in the older norms formats to convert to the newer encodings. - the removal of numeric fields queries would not fall under the promises we make with compatibility of N-2 and it would be the responsibility of the user to keep the code around that understands the value of a field. I hope this clarifies some of the aspects? we would only do all this for the reading end, for writing we would reject indices that are older than N-1 simon On Thu, Jan 7, 2021 at 8:04 PM jim ferenczi wrote: > > The proposal is only about keeping the ability to read file-format up to N-2. > Everything that is done on top of the file format is not guaranteed and > should be supported on a best-effort basis. > That's an important aspect if we don't want to block innovation. So in > practice that means that queries that require some specific file format or > analyzers that change behaviors in major versions would not be part of the > extended guarantee. > > > Le mer. 6 janv. 2021 à 21:53, Yonik Seeley a écrit : >> >> On Wed, Jan 6, 2021 at 4:40 AM Simon Willnauer >> wrote: >>> >>> You can open a reader on an index created by >>> version N-2, but you cannot open an IndexWriter on it >> >> >> +1 >> There should definitely be more consideration given to back compat in >> general... it's caused a ton of pain to users over time. >> >> -Yonik >> >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: additional term meta data
John, can you explain what the usecase for such a new API is? I don't see a user of the API in your code. Is there a query you can optimize with this or what is the reasoning behind this change? I personally think it's quite invasive to add this information and there must be a good reason to add this to the TermsEnum? I also don't think we should have an option on the field for this if we add it but if we don't do that it's quite a heavy change so I am on the fence if we should even consider this? I wonder if you can use the TermsEnum#getAttributeSource() API instead and add this as a dedicated attribute which is present if the info is stored. That way you can build your own PostingsFormat that does store this information? simon On Wed, Jan 6, 2021 at 8:06 PM John Wang wrote: > > Thank you, Martin! > > You can apply the patch to the 8.7 build by just ignoring the changes to > Lucene90xxx. Appreciate the help and guidance! > > -John > > > On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty wrote: >> >> appears you are targeting 9.0 for your code >> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java >> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros) >> >> >> someone had the bright idea to nuke ant 8.x build.xml without consulting >> anyone >> not a fan of ant but the execution model of gradle is woefully inflexible in >> comparison to maven >> >> >> i will try with 90 distro to get the >> codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your >> TestLucene84PostingsFormat will run w/o fail or error >> >> Thx >> martin- >> >> >> From: John Wang >> Sent: Wednesday, January 6, 2021 10:15 AM >> To: dev@lucene.apache.org >> Subject: Re: additional term meta data >> >> Hey Martin: >> >> There is a test case in the PR we created on our own fork: >> https://github.com/dashbase/lucene-solr/pull/1, which also contains some >> example code on how to access in the PR description. >> >> Here is the link to the beginning of the tests: >> https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 >> >> I am not sure which version this should be applied to, currently, it was >> based on master as of a few days ago. We intend to patch 8.7 for our own >> environment. >> >> Any advice or feedback is much appreciated. >> >> Thank you! >> >> -John >> >> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty wrote: >> >> how to access first and last? >> which version will you be merging >> >> >> From: John Wang >> Sent: Tuesday, January 5, 2021 8:19 PM >> To: dev@lucene.apache.org >> Subject: additional term meta data >> >> Hi folks: >> >> We like to propose a feature to add additional per-term metadata to the term >> diction. >> >> Currently, the TermsEnum API returns docFreq as its only meta-data. We >> needed a way to quickly get the first and last doc id in the postings >> without having to scan through the entire postings list. >> >> We have created a PR on our own fork and we would like to contribute this >> back to the community. Please let us know if this is something that's useful >> and/or fits Lucene's roadmap, we would be happy to submit a patch. >> >> https://github.com/dashbase/lucene-solr/pull/1 >> >> Thank you >> >> -John - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RFC: N-2 compatibility for file formats
Hello all, Currently Lucene supports reading and writing indices that have been created with the current or previous (N-1) version of Lucene. Lucene refuses to open an index created by N-2 or earlier versions. I would like to propose that Lucene adds support for opening indices created by version N-2 in read-only mode. Here's what I have in mind: - Read-only support. You can open a reader on an index created by version N-2, but you cannot open an IndexWriter on it, meaning that you can neither delete, update, add documents or force-merge N-2 indices. - File-format compatibility only. File-format compatibility enables reading the content of old indices, but not more. Everything that is done on top of file formats like analysis or the encoding of length normalization factors is not guaranteed and only supported on a best-effort basis. The reason I came up with these limitations is because I wanted to make the scope minimal in order to retain Lucene's ability to move forward. If there is consensus to move forward with this, I would like to target Lucene 9.0 with this change. Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Deterministic index construction
you can do something similar to this today by exploiting the add/updateDocuments(Iterable doc) API. All docs in this iterable will be sent to the same segment in order. If you have multiple threads you can feed a defined number of docs per iterable (stream them to be memory efficient) and then let them go at the same time. this way you have thread affinity (we had this in the early days of DWPT, I'd be reluctant to make it configurable again). then with a custom merge policy you should be able to get the exact same number of segments without remerging etc. some sync overhead on top but it's doable I think. simon On Wed, Dec 23, 2020 at 10:30 PM David Smiley wrote: > > I like Mike McCandless's suggestion of controlling which DWPT (and thus > segment) an incoming document goes to. I've thought of this before for a > different use case grouping documents into segments by the underlying "type" > of the document. This could make sense for a use-case that queries by > document type, and you don't want to create an index per document type (maybe > because the index is too small to warrant it). It could even be used in a > kind of soft / hint kind of way -- not an absolute strict separation. For > example, say if some subset of DWPTs are known to hold docs of a given type, > then add incoming docs of that type to any of those and not the others. But > if none exist then just add to any DWPT. I also thought of this sort of > thing at the MergePolicy level, but at that point, any mixing of doc types > has already occurred and MP can't separate them, it can only combine, though > it can try to reduce introducing too much mixing. It would be nice if it > were possible to atomically merge some documents in a segment but not the > whole segment, thus still leaving the segment in place but with the extracted > documents marked deleted. This is similar to "shard splitting" (index > splitting) but to do so atomically/transactionally. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Sun, Dec 20, 2020 at 10:24 AM Michael McCandless > wrote: >> >> I think the addIndexes approach could work as Haoyu describes! One >> IndexWriter per segment in the original source index, using >> FilterIndexReader to ... mark all documents NOT in the target segment as >> deleted? >> >> For the final step, you could use addIndexes(Directory[]) which more of less >> does a simple file copy of the incoming segment's files. >> >> But this is a whole extra and costly sounding step, that might undo the wall >> clock speedup from the concurrent indexing in the first pass. Maybe it is >> still faster net/net than what luceneutil benchmarks, which is >> single-threaded-everything (single indexing thread, SerialMergeScheduler, >> LogDocMergePolicy)? >> >> The first option Haoyu listed sounds interesting too! Could we somehow >> build a new index, concurrently, but force certain docs to go to certain >> in-memory segments (DWPT)? Today the routing of incoming indexing thread to >> DWPT is sort of random, but there is indeed a dedicated internal class that >> decides that: DocumentsWriterPerThreadPool. And, here is a fun PR that >> Adrien is working on to improve how threads are scheduled onto in-memory >> segments, to try to create larger initially flushed segments and less merge >> pressure as a result: https://github.com/apache/lucene-solr/pull/1912 >> >> If we could carefully guide threads to the right DWPT during indexing the >> 2nd time, and then use a custom MergePolicy that is also careful to only >> merge segments that "belong" together, and the index is sorted, I think you >> would get the same segment geometry in the end, and exact same documents in >> each segments? This'd likely be nearly as fast as freely building an index >> concurrently! It'd be a nice addition to luceneutil benchmarks too, since >> now it takes crazy long to build the deterministic index. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai wrote: >>> >>> Hi Adrien >>> I think Mike's comment is correct, we already have index sorted but we want >>> to reconstruct a index with exact same number of segments and each segment >>> contains exact same documents. >>> >>> Mike >>> AddIndexes could take CodecReader as input [1], which allows us to pass in >>> a customized FilteredIndexReader I think? Then it knows which docs to take. >>> And then suppose original index has N segments, we could open N IndexWriter >>> concurrently and rebuilt those N segments, and at last somehow merge them >>> back to a whole index. (I am not quite sure about whether we could achieve >>> the last step easily, but that sounds not so hard?) >>> >>> [1] >>> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...- >>> >>> Michael Sokolov 于2020年12月1
Re: Old programmers do fade away
Eric, thanks so much for your open and true words! You will always be part of this community if you subscribed to the lists or not. (you can't escape :D) Thanks for your contributions, this is a team effort and you are a part of it. enjoy the welding!! simon On Wed, Dec 30, 2020 at 3:09 PM Erick Erickson wrote: > > 40 years is enough. OK, it's only been 39 1/2 years. Dear Lord, has it really > been that long? Programming's been fun, I've gotten to solve puzzles every > day. The art and science of programming has changed over that time. Let me > tell you about the joys of debugging with a Z80 stack emulator that required > that you to look on the stack for variables and trace function calls by > knowing how to follow frame pointers. Oh the tedium! Oh the (lack of) speed! > Not to mention that 64K of memory was all you had to work with. I had a > co-worker who could predict the number of bytes by which the program would > shrink based on extracting common code to functions. The "good old > days"...weren't... > > I'd been thinking that I'd treat Lucene/Solr as a hobby, doing occasional > work on it when I was bored over long winter nights. I've discovered, though, > that I've been increasingly reluctant to crack open the code. I guess that > after this much time, I'm ready to hang up my spurs. One major factor is the > realization that there's so much going on with Lucene/Solr that simply being > aware of the changes, much less trying to really understand them, isn't > something I can do casually. > > I bought a welder and find myself more interested in playing with that than > programming. Wait until you see the squirrel-proof garden enclosure I'm > building with it. If my initial plan doesn't work, next up is an electric > fence along the top. The laser-sighted automatic machine gun emplacement will > take more planning...Ahhh, probably won't be able to get a permit from the > township for that though. Do you think the police would notice? Perhaps I > should add that the local police station is two blocks away and in the line > of fire. But an infrared laser powerful enough to "pre-cook" them wouldn't be > as obvious would it? > > Why am I so fixated on squirrels? One of the joys of gardening is fresh > tomatoes rather than those red things they sell in the store. The squirrels > ATE EVERY ONE OF MY TOMATOES WHILE THEY WERE STILL GREEN LAST YEAR! And the > melons. In the words of B. Bunny: "Of course you realize this means war" > (https://www.youtube.com/watch?v=4XNr-BQgpd0)... > > Then there's working in the garden and landscaping, the desk I want to build > for my wife, travel as soon as I can, maybe seeing if some sailboats need > crew...you get the idea. > > It's been a privilege to work with this group, you're some of the best and > brightest. Many thanks to all who've generously given me their time and > guidance. It's been a constant source of amazement to me how willing people > are to take time out of their own life and work to help me when I've had > questions. I owe a lot of people beers ;) > > I'll be stopping my list subscriptions, Slack channels (dm me if you need > something), un-assigning any JIRAs and that kind of thing over the next > while. If anyone's interested in taking over the BadApple report, let me know > and I can put the code up somewhere. It takes about 10 minutes to do each > week. I won't disappear entirely, things like the code-reformatting effort > are nicely self-contained for instance and something I can to casually. > > My e-mail address if you need to get in touch with me is: > "erick.erick...@gmail.com". There's a correlation between gmail addresses > that are just a name with no numbers and a person's age... A co-worker came > over to my desk in pre-historical times and said "there's this new mail > service you might want to sign up for"... Like I said, 40 years is enough. > > Best to all, > Erick > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Lucene logo contest, third time's a charm
Thank you ryan for pushing on this, being persistent and getting the vote out. On Tue, Sep 8, 2020 at 5:55 PM Ryan Ernst wrote: > This vote is now closed. The results are as follows: > > Binding Results > A1: 12 (55%) > D: 6 (27%) > A2: 4 (18%) > > All Results > A1: 16 (55%) > D: 7 (24%) > A2: 5 (17%) > B5d: 1 (3%) > > A1 is our winner! I will make the necessary changes. > > Thank you to Dustin Haver, Stamatis Zampetakis, Baris Kazar and all who > voted! > > On Tue, Sep 1, 2020 at 1:21 PM Ryan Ernst wrote: > > > Dear Lucene and Solr developers! > > > > Sorry for the multiple threads. This should be the last one. > > > > In February a contest was started to design a new logo for Lucene > > [jira-issue]. The initial attempt [first-vote] to call a vote resulted in > > some confusion on the rules, as well the request for one additional > > submission. The second attempt [second-vote] yesterday had incorrect > links > > for one of the submissions. I would like to call a new vote, now with > more > > explicit instructions on how to vote, and corrected links. > > > > *Please read the following rules carefully* before submitting your vote. > > > > *Who can vote?* > > > > Anyone is welcome to cast a vote in support of their favorite > > submission(s). Note that only PMC member's votes are binding. If you are > a > > PMC member, please indicate with your vote that the vote is binding, to > > ease collection of votes. In tallying the votes, I will attempt to verify > > only those marked as binding. > > > > > > *How do I vote?* > > Votes can be cast simply by replying to this email. It is a ranked-choice > > vote [rank-choice-voting]. Multiple selections may be made, where the > order > > of preference must be specified. If an entry gets more than half the > votes, > > it is the winner. Otherwise, the entry with the lowest number of votes is > > removed, and the votes are retallied, taking into account the next > > preferred entry for those whose first entry was removed. This process > > repeats until there is a winner. > > > > The entries are broken up by variants, since some entries have multiple > > color or style variations. The entry identifiers are first a capital > > letter, followed by a variation id (described with each entry below), if > > applicable. As an example, if you prefer variant 1 of entry A, followed > by > > variant 2 of entry A, variant 3 of entry C, entry D, and lastly variant > 4e > > of entry B, the following should be in your reply: > > > > (binding) > > vote: A1, A2, C3, D, B4e > > > > *Entries* > > > > The entries are as follows: > > > > A*.* Submitted by Dustin Haver. This entry has two variants, A1 and A2. > > > > [A1] > > > https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png > > [A2] > > https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png > > > > B. Submitted by Stamatis Zampetakis. This has several variants. Within > the > > linked entry there are 7 patterns and 7 color palettes. Any vote for B > > should contain the pattern number followed by the lowercase letter of the > > color palette. For example, B3e or B1a. > > > > [B] > > > https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf > > > > C. Submitted by Baris Kazar. This entry has 8 variants. > > > > [C1] > > > https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf > > [C2] > > > https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf > > [C3] > > > https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf > > [C4] > > > https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf > > [C5] > > > https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf > > [C6] > > > https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf > > [C7] > > > https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf > > [C8] > > > https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf > > > > D. The current Lucene logo. > > > > [D] > > https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png > > > > Please vote for one of the above choices. This vote will close about one > > week from today, Mon, Sept 7, 2020 at 11:59PM. > > > > Thanks! > > > > [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221 > > [first-vote] > > > http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e > > [second-vote] > > > http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e > > [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting > > >
Re: [VOTE] Lucene logo contest, third time's a charm
A1, A2, D (binding) On Thu, Sep 3, 2020 at 7:09 AM Noble Paul wrote: > A1, A2, D binding > > On Thu, Sep 3, 2020 at 7:22 AM Jason Gerlowski > wrote: > > > > A1, A2, D (binding) > > > > On Wed, Sep 2, 2020 at 10:47 AM Michael McCandless > > wrote: > > > > > > A2, A1, C5, D (binding) > > > > > > Thank you to everyone for working so hard to make such cool looking > possible future Lucene logos! And to Ryan for the challenging job of > calling this VOTE :) > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > > > > On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst wrote: > > >> > > >> Dear Lucene and Solr developers! > > >> > > >> Sorry for the multiple threads. This should be the last one. > > >> > > >> In February a contest was started to design a new logo for Lucene > [jira-issue]. The initial attempt [first-vote] to call a vote resulted in > some confusion on the rules, as well the request for one additional > submission. The second attempt [second-vote] yesterday had incorrect links > for one of the submissions. I would like to call a new vote, now with more > explicit instructions on how to vote, and corrected links. > > >> > > >> Please read the following rules carefully before submitting your vote. > > >> > > >> Who can vote? > > >> > > >> Anyone is welcome to cast a vote in support of their favorite > submission(s). Note that only PMC member's votes are binding. If you are a > PMC member, please indicate with your vote that the vote is binding, to > ease collection of votes. In tallying the votes, I will attempt to verify > only those marked as binding. > > >> > > >> How do I vote? > > >> > > >> Votes can be cast simply by replying to this email. It is a > ranked-choice vote [rank-choice-voting]. Multiple selections may be made, > where the order of preference must be specified. If an entry gets more than > half the votes, it is the winner. Otherwise, the entry with the lowest > number of votes is removed, and the votes are retallied, taking into > account the next preferred entry for those whose first entry was removed. > This process repeats until there is a winner. > > >> > > >> The entries are broken up by variants, since some entries have > multiple color or style variations. The entry identifiers are first a > capital letter, followed by a variation id (described with each entry > below), if applicable. As an example, if you prefer variant 1 of entry A, > followed by variant 2 of entry A, variant 3 of entry C, entry D, and lastly > variant 4e of entry B, the following should be in your reply: > > >> > > >> (binding) > > >> vote: A1, A2, C3, D, B4e > > >> > > >> Entries > > >> > > >> The entries are as follows: > > >> > > >> A. Submitted by Dustin Haver. This entry has two variants, A1 and A2. > > >> > > >> [A1] > https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png > > >> [A2] > https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png > > >> > > >> B. Submitted by Stamatis Zampetakis. This has several variants. > Within the linked entry there are 7 patterns and 7 color palettes. Any vote > for B should contain the pattern number followed by the lowercase letter of > the color palette. For example, B3e or B1a. > > >> > > >> [B] > https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf > > >> > > >> C. Submitted by Baris Kazar. This entry has 8 variants. > > >> > > >> [C1] > https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf > > >> [C2] > https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf > > >> [C3] > https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf > > >> [C4] > https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf > > >> [C5] > https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf > > >> [C6] > https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf > > >> [C7] > https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf > > >> [C8] > https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf > > >> > > >> D. The current Lucene logo. > > >> > > >> [D] > https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png > > >> > > >> Please vote for one of the above choices. This vote will close about > one week from today, Mon, Sept 7, 2020 at 11:59PM. > > >> > > >> Thanks! > > >> > > >> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221 > > >> [first-vote] > http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e > > >> [second-vote] > http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e > > >> [rank-choice-voting] > https://en.wikipedia.org/wiki/Instant-runoff_voting > > > > ---
Re: [VOTE] Release Lucene/Solr 8.6.2 RC1
+1 binding release looks good to me On Thu, Aug 27, 2020 at 3:58 PM Atri Sharma wrote: > > +1 (binding) > > SUCCESS! [1:14:17.24939] > > On Thu, 27 Aug 2020 at 18:41, Michael Sokolov wrote: >> >> SUCCESS! [0:56:28.589654] >> >> >> >> +1 >> >> >> >> On Wed, Aug 26, 2020 at 12:41 PM Nhat Nguyen >> >> wrote: >> >> > >> >> > +1 >> >> > >> >> > SUCCESS! [0:52:44.607871] >> >> > >> >> > On Wed, Aug 26, 2020 at 12:12 PM Tomoko Uchida >> > wrote: >> >> >> >> >> >> +1 (non-binding) >> >> >> SUCCESS! [0:51:55.207272] >> >> >> >> >> >> >> >> >> 2020年8月26日(水) 22:42 Ignacio Vera : >> >> >>> >> >> >>> Please vote for release candidate 1 for Lucene/Solr 8.6.2 >> >> >>> >> >> >>> >> >> >>> The artifacts can be downloaded from: >> >> >>> >> >> >>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.2-RC1-rev016993b65e393b58246d54e8ddda9f56a453eb0e >> >> >>> >> >> >>> >> >> >>> You can run the smoke tester directly with this command: >> >> >>> >> >> >>> >> >> >>> python3 -u dev-tools/scripts/smokeTestRelease.py \ >> >> >>> >> >> >>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.2-RC1-rev016993b65e393b58246d54e8ddda9f56a453eb0e >> >> >>> >> >> >>> >> >> >>> The vote will be open for at least 72 hours i.e. until 2020-08-29 15:00 >> >>> UTC. >> >> >>> >> >> >>> >> >> >>> [ ] +1 approve >> >> >>> >> >> >>> [ ] +0 no opinion >> >> >>> >> >> >>> [ ] -1 disapprove (and reason why) >> >> >>> >> >> >>> >> >> >>> Here is my +1 >> >> >>> >> >> >>> >> >> >>> SUCCESS! [1:14:00.656250] >> >> >> >> - >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >> > > > -- > Regards, > > Atri > Apache Concerted - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr 8.6.2 bugfix release
I'd actually like to build the RC earlier than the end of the week. Unless somebody objects I'd like to build one tonight or tomorrow. simon On Tue, Aug 25, 2020 at 7:52 AM Ishan Chattopadhyaya wrote: > > Thanks Simon and Ignacio! > > On Tue, 25 Aug, 2020, 11:21 am Simon Willnauer, > wrote: >> >> +1 thank you! I was about to write the same email. Lets sync on the RM >> I can certainly help... I need to go and find my code signing key >> first :) >> >> simon >> >> On Tue, Aug 25, 2020 at 7:49 AM Ignacio Vera wrote: >> > >> > Hi, >> > >> > I propose a 8.6.2 bugfix release and I volunteer as RM. The motivation for >> > this release is LUCENE-9478 where Simon addressed a serious memory leak in >> > DWPTDeleteQueue. >> > >> > If there are no objections I am planning to build the first RC by the end >> > of this week. >> > >> > Ignacio >> > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lucene/Solr 8.6.2 bugfix release
+1 thank you! I was about to write the same email. Lets sync on the RM I can certainly help... I need to go and find my code signing key first :) simon On Tue, Aug 25, 2020 at 7:49 AM Ignacio Vera wrote: > > Hi, > > I propose a 8.6.2 bugfix release and I volunteer as RM. The motivation for > this release is LUCENE-9478 where Simon addressed a serious memory leak in > DWPTDeleteQueue. > > If there are no objections I am planning to build the first RC by the end of > this week. > > Ignacio > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Lots of failures lately for lucene.index.TestBackwardsCompatability.testAllVersionsTested
I thinks it’s fixed now. The 7.7.3 version was missing. Simon > On 16. May 2020, at 22:45, Erick Erickson wrote: > > Unfortunately the seed doesn’t reproduce, and I tried beasting it without > getting any fails in 700 iterations (and counting). > > Here’s one example, I see three others in the last couple of hours. > > I’ve done zero investigation into where these are coming from, but I did > notice there started being a lot of them starting 2-3 (?) days ago. > > Build: https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Windows/1134/ > Java: 64bit/jdk-11.0.6 -XX:+UseCompressedOops -XX:+UseParallelGC > > 6 tests failed. > FAILED: > org.apache.lucene.index.TestBackwardsCompatibility.testAllVersionsTested > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Solr to become a top-level Apache project (TLP)
I agree this is not a code change category vote. It’s a majority vote. -1s are not vetos. Simon > On 12. May 2020, at 21:17, Atri Sharma wrote: > > > I would argue against that — this is more of a project level decision with no > changes to the core code base per se — more of restructuring of it. Sort of > how a sub project becomes a TLP. > >> On Wed, 13 May 2020 at 00:38, Ishan Chattopadhyaya >> wrote: >> This is in the code modification category, since code will be modified as >> result of this proposal. >> >>> On Wed, 13 May, 2020, 12:27 am Shawn Heisey, wrote: >>> On 5/12/2020 1:36 AM, Dawid Weiss wrote: >>> > According to an earlier [DISCUSS] thread on the dev list [2], I am >>> > calling for a vote on the proposal to make Solr a top-level Apache >>> > project (TLP) and separate Lucene and Solr development into two >>> > independent entities. >>> >>> +1 (pmc) >>> >>> We should clarify exactly what kind of vote this is. If it is in the >>> "code modification" category, then a single -1 vote would be enough to >>> defeat the proposal. There are already some -1 votes. >>> >>> Thanks, >>> Shawn >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> > -- > Regards, > > Atri > Apache Concerted
Re: [VOTE] Solr to become a top-level Apache project (TLP)
+1 binding Sent from a mobile device > On 12. May 2020, at 13:33, Jason Gerlowski wrote: > > -1 (binding) > >> On Tue, May 12, 2020 at 7:31 AM Alan Woodward wrote: >> >> +1 (binding) >> >> Alan Woodward >> On 12 May 2020, at 12:06, Jan Høydahl wrote: >>> >>> +1 (binding) >>> >>> Jan Høydahl >>> 12. mai 2020 kl. 09:36 skrev Dawid Weiss : Dear Lucene and Solr developers! According to an earlier [DISCUSS] thread on the dev list [2], I am calling for a vote on the proposal to make Solr a top-level Apache project (TLP) and separate Lucene and Solr development into two independent entities. To quickly recap the reasons and consequences of such a move: it seems like the reasons for the initial merge of Lucene and Solr, around 10 years ago, have been achieved. Both projects are in good shape and exhibit signs of independence already (mailing lists, committers, patch flow). There are many technical considerations that would make development much easier if we move Solr out into its own TLP. We discussed this issue [2] and both PMC members and committers had a chance to review all the pros and cons and express their views. The discussion showed that there are clearly different opinions on the matter - some people are in favor, some are neutral, others are against or not seeing the point of additional labor. Realistically, I don't think reaching 100% level consensus is going to be possible -- we are a diverse bunch with different opinions and personalities. I firmly believe this is the right direction hence the decision to put it under the voting process. Should something take a wrong turn in the future (as some folks worry it may), all blame is on me. Therefore, the proposal is to separate Solr from under Lucene TLP, and make it a TLP on its own. The initial structure of the new PMC, committer base, git repositories and other managerial aspects can be worked out during the process if the decision passes. Please indicate one of the following (see [1] for guidelines): [ ] +1 - yes, I vote for the proposal [ ] -1 - no, I vote against the proposal Please note that anyone in the Lucene+Solr community is invited to express their opinion, though only Lucene+Solr committers cast binding votes (indicate non-binding votes in your reply, please). The vote will be active for a week to give everyone a chance to read and cast a vote. Dawid [1] https://www.apache.org/foundation/voting.html [2] https://lists.apache.org/thread.html/rfae2440264f6f874e91545b2030c98e7b7e3854ddf090f7747d338df%40%3Cdev.lucene.apache.org%3E - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)
On Sun, May 10, 2020 at 3:41 PM Bram Van Dam wrote: > > On 10/05/2020 08:20, David Smiley wrote: > > An idea just occurred to me that may help make a split nicer for Solr > > than it is today. Solr could use a branch of the Lucene project that's > > used for the Solr project. > > Maybe I'm alone in this, but (better) Lucene compatibility is one of the > reasons why our company chose Solr over ElasticSearch. I though about this for a while and I do wonder if you could elaborate on what makes Solr have a better compatibility with Lucene. That's certainly something elasticsearch would want to catch up on since it sounds like a clear benefit for users. Maybe I just misunderstood what you meant hence couldn't make much sense out of it. simon > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)
I can speak from experience that working with a snapshot is much cleaner than working with submodules. We do this in elasticsearch for a very long time now and our process here works just fine. It has a bunch of advantages over a direct / source dependency like solr has right now. I recall that someone else already mentioned some of them like working on somewhat more stable codebase etc. do refactorings and integration when there are people dedicated to it and have enough time to do it properly. Regarding the effort of a split, I think that not doing something because it's a lot of work will just cause a ton of issues down the road. Doing the right thing is a lot of work that's for sure but we can start working on this in baby steps an we can all help. Like we can gradually do this, start with website, lists then build system etc. or start with build first and do website last. It's ok to apply progress over perfection here. We all want this to be done properly and we are all here to help, at least I am. simon On Wed, May 6, 2020 at 10:51 AM Ishan Chattopadhyaya wrote: > > Except the logistics of enacting the split, I see no valid reason of keeping > the projects together. Git submodule is the magic that we have to ease any > potential discomfort. However, the effort needed to split feels absolutely > massive, so I'm not sure if it is worth the hassle. > > On Wed, 6 May, 2020, 1:31 pm Dawid Weiss, wrote: >> >> > If you go to lucene.apache.org, you'll see three things: Lucene Core >> > (Lucene with all it's modules), Solr and PyLucene. That's what I mean. >> >> Hmm... Maybe I'm dim but that's essentially what I want to do. Look: >> >> 1. Lucene Core (Lucene with all it's modules) >> 2. Solr >> 3. PyLucene >> >> The thing is: (1) is already a TLP - that's just Lucene. My call is to >> make (2) a TLP. (3) I can't tell much about because I don't know >> PyLucene as well as I do Solr and Lucene... But it seems to me that >> PyLucene fits much better under "Lucene" umbrella, even the name >> suggests that. >> >> >> >> Dawid >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete
[ https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906502#comment-16906502 ] Simon Willnauer commented on LUCENE-8369: - +1 for option 1 above as well. Thanks [~nknize] > Remove the spatial module as it is obsolete > --- > > Key: LUCENE-8369 > URL: https://issues.apache.org/jira/browse/LUCENE-8369 > Project: Lucene - Core > Issue Type: Task > Components: modules/spatial >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8369.patch > > > The "spatial" module is at this juncture nearly empty with only a couple > utilities that aren't used by anything in the entire codebase -- > GeoRelationUtils, and MortonEncoder. Perhaps it should have been removed > earlier in LUCENE-7664 which was the removal of GeoPointField which was > essentially why the module existed. Better late than never. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete
[ https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902133#comment-16902133 ] Simon Willnauer commented on LUCENE-8369: - I don't think we should scarify the existence of LatLong point searching out of core for the sake of code visibility. I think we should keep it in core and open up visibility to enable code-reuse in the modules and use _@lucene.internal_ in order to mark classes as internal and prevent users from complaining when the API changes. It's not ideal but progress. Can we separate the disucssion of getting rid of the spacial module from graduating the various shapes from sandbox to wherever? I think keeping a module for 2 classes doesn't make sense. We can move those two classes to core too or even get rid of them altogether I don't think it should influence the discussion if something else should be graduated. One other option would be we move all non-core spacials from sandbox to spatial as long as they don't add any additional dependency. that would be an intermediate step. we can still graduate from there then. > Remove the spatial module as it is obsolete > --- > > Key: LUCENE-8369 > URL: https://issues.apache.org/jira/browse/LUCENE-8369 > Project: Lucene - Core > Issue Type: Task > Components: modules/spatial >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Attachments: LUCENE-8369.patch > > > The "spatial" module is at this juncture nearly empty with only a couple > utilities that aren't used by anything in the entire codebase -- > GeoRelationUtils, and MortonEncoder. Perhaps it should have been removed > earlier in LUCENE-7664 which was the removal of GeoPointField which was > essentially why the module existed. Better late than never. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8887) CLONE - Add setting for moving FST offheap/onheap
[ https://issues.apache.org/jira/browse/LUCENE-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8887. - Resolution: Duplicate this seems to be opened accidentially > CLONE - Add setting for moving FST offheap/onheap > - > > Key: LUCENE-8887 > URL: https://issues.apache.org/jira/browse/LUCENE-8887 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs, core/store >Reporter: LuYunCheng > Assignee: Simon Willnauer >Priority: Minor > Fix For: master (9.0), 8.1 > > Attachments: offheap_generic_settings.patch, offheap_settings.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > While LUCENE-8635, adds support for loading FST offheap using mmap, users do > not have the flexibility to specify fields for which FST needs to be > offheap. This allows users to tune heap usage as per their workload. > Ideal way will be to add an attribute to FieldInfo, where we have > put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the > appropriate On/OffHeapStore when creating its FST. It can support special > keywords like ALL/NONE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor
[ https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872690#comment-16872690 ] Simon Willnauer commented on LUCENE-8865: - [~hypothesisx86] I didn't run any benchmarks. maybe [~mikemccand] can provide infos if there are improvements. > Use incoming thread for execution if IndexSearcher has an executor > --- > > Key: LUCENE-8865 > URL: https://issues.apache.org/jira/browse/LUCENE-8865 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Today we don't utilize the incoming thread for a search when IndexSearcher > has an executor. This thread is only idleing but can be used to execute a > search > once all other collectors are dispatched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers
[ https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868555#comment-16868555 ] Simon Willnauer commented on LUCENE-8857: - A couple of comments: * can you open a PR and associate it with this issue. Patches are so hard to review without context and the ability to comment * for the second case in IndexsSearcher should we also tie-break by doc? * Can we replace the verbose comparators with _Comparator.comparingInt(d -> d.shardIndex);_ and _Comparator.comparingInt(d -> d.doc);_ respectively? * Any chance we can select the tie-breaker based on if one of the TopDocs has a shardIndex != -1 and assert that all of them have it or not? Another option would be to have only one comparator and first tie-break on shardIndex and then on doc since we don't set the shard index it should be fine since they are all -1? WDYT? > Refactor TopDocs#Merge To Take In Custom Tie Breakers > - > > Key: LUCENE-8857 > URL: https://issues.apache.org/jira/browse/LUCENE-8857 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch, > LUCENE-8857.patch, LUCENE-8857.patch > > > In LUCENE-8829, the idea of having lambdas passed in to the API to allow > finer control over the process was discussed. > This JIRA tracks adding a parameter to the API which allows passing in > lambdas to define custom tie breakers, thus allowing users to do custom > algorithms when required. > CC: [~jpountz] [~simonw] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor
[ https://issues.apache.org/jira/browse/LUCENE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8865. - Resolution: Fixed Fix Version/s: 8.2 master (9.0) > Use incoming thread for execution if IndexSearcher has an executor > --- > > Key: LUCENE-8865 > URL: https://issues.apache.org/jira/browse/LUCENE-8865 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Today we don't utilize the incoming thread for a search when IndexSearcher > has an executor. This thread is only idleing but can be used to execute a > search > once all other collectors are dispatched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers
[ https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866575#comment-16866575 ] Simon Willnauer commented on LUCENE-8857: - Why don't we just use the comparator and have a default and a doc one? like this: {code} Comparator defaultComparator = Comparator.comparingInt(d -> d.shardIndex); Comparator docComparator = Comparator.comparingInt(d -> d.doc); {code} > Refactor TopDocs#Merge To Take In Custom Tie Breakers > - > > Key: LUCENE-8857 > URL: https://issues.apache.org/jira/browse/LUCENE-8857 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8857.patch, LUCENE-8857.patch, LUCENE-8857.patch > > > In LUCENE-8829, the idea of having lambdas passed in to the API to allow > finer control over the process was discussed. > This JIRA tracks adding a parameter to the API which allows passing in > lambdas to define custom tie breakers, thus allowing users to do custom > algorithms when required. > CC: [~jpountz] [~simonw] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8853) FileSwitchDirectory is broken if temp outputs are used
[ https://issues.apache.org/jira/browse/LUCENE-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8853. - Resolution: Fixed Fix Version/s: 8.2 master (9.0) > FileSwitchDirectory is broken if temp outputs are used > -- > > Key: LUCENE-8853 > URL: https://issues.apache.org/jira/browse/LUCENE-8853 > Project: Lucene - Core > Issue Type: Bug > Reporter: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > FileSwitchDirectory basically doesn't work if tmp output are used for files > that are explicitly mapped with extensions. here is a failing test: > {code} > 16:49:40[junit4] Suite: > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest > 16:49:40[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=BlendedInfixSuggesterTest > -Dtests.method=testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch > -Dtests.seed=16D8C93DC8FE5192 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=pt-LU -Dtests.timezone=US/Michigan -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 > 16:49:40[junit4] ERROR 0.05s J1 | > BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch > <<< > 16:49:40[junit4]> Throwable #1: > java.nio.file.AtomicMoveNotSupportedException: _0.fdx__0.tmp -> _0.fdx: > source and dest are in different directories > 16:49:40[junit4]> at > __randomizedtesting.SeedInfo.seed([16D8C93DC8FE5192:20E180A9490374CE]:0) > 16:49:40[junit4]> at > org.apache.lucene.store.FileSwitchDirectory.rename(FileSwitchDirectory.java:201) > 16:49:40[junit4]> at > org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) > 16:49:40[junit4]> at > org.apache.lucene.store.LockValidatingDirectoryWrapper.rename(LockValidatingDirectoryWrapper.java:56) > 16:49:40[junit4]> at > org.apache.lucene.store.TrackingDirectoryWrapper.rename(TrackingDirectoryWrapper.java:64) > 16:49:40[junit4]> at > org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) > 16:49:40[junit4]> at > org.apache.lucene.index.SortingStoredFieldsConsumer.flush(SortingStoredFieldsConsumer.java:56) > 16:49:40[junit4]> at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:152) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:555) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:722) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3199) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3444) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3409) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.commit(AnalyzingInfixSuggester.java:345) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.build(AnalyzingInfixSuggester.java:315) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.getBlendedInfixSuggester(BlendedInfixSuggesterTest.java:125) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch(BlendedInfixSuggesterTest.java:79) > 16:49:40[junit4]> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 16:49:40[junit4]> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 16:49:40[junit4]> at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 16:49:40[junit4]> at > java.base/java.lang.reflect.Method.invoke(Method.java:566) > 16:49:40[junit4]> at > java.base/java.lang.Thread.run(Thread.java:834) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8857) Refactor TopDocs#Merge To Take In Custom Tie Breakers
[ https://issues.apache.org/jira/browse/LUCENE-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866253#comment-16866253 ] Simon Willnauer commented on LUCENE-8857: - >From my perspective we should simplify this even more and remove >_TieBreakingParameters_. TopDocs can use _Comparator_ and default >to the shard index if it's not supplied. That should be sufficient? > Refactor TopDocs#Merge To Take In Custom Tie Breakers > - > > Key: LUCENE-8857 > URL: https://issues.apache.org/jira/browse/LUCENE-8857 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8857.patch, LUCENE-8857.patch > > > In LUCENE-8829, the idea of having lambdas passed in to the API to allow > finer control over the process was discussed. > This JIRA tracks adding a parameter to the API which allows passing in > lambdas to define custom tie breakers, thus allowing users to do custom > algorithms when required. > CC: [~jpountz] [~simonw] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8865) Use incoming thread for execution if IndexSearcher has an executor
Simon Willnauer created LUCENE-8865: --- Summary: Use incoming thread for execution if IndexSearcher has an executor Key: LUCENE-8865 URL: https://issues.apache.org/jira/browse/LUCENE-8865 Project: Lucene - Core Issue Type: Improvement Reporter: Simon Willnauer Today we don't utilize the incoming thread for a search when IndexSearcher has an executor. This thread is only idleing but can be used to execute a search once all other collectors are dispatched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
[ https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863067#comment-16863067 ] Simon Willnauer commented on LUCENE-8829: - {quote} Simon Willnauer That is a fun idea, although it would still need a function to instruct TopDocs#merge whether to set the shard indices or not. {quote} I am not sure we have to. Can't a user initialize it ahead of time if necessary. I think if it's necessary to have this we can just iterate over it and set it from the outside? That should also be possible no? > TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved > - > > Key: LUCENE-8829 > URL: https://issues.apache.org/jira/browse/LUCENE-8829 > Project: Lucene - Core > Issue Type: Bug >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, > LUCENE-8829.patch > > > While investigating LUCENE-8819, I understood that TopDocs#merge's order of > results are indirectly dependent on the number of collectors involved in the > merge. This is troubling because 1) The number of collectors involved in a > merge are cost based and directly dependent on the number of slices created > for the parallel searcher case. 2) TopN hits code path will invoke merge with > a single Collector, so essentially, doing the same TopN query with single > threaded and parallel threaded searcher will invoke different order of > results, which is a bad invariant that breaks. > > The reason why this happens is because of the subtle way TopDocs#merge sets > shardIndex in the ScoreDoc population during populating the priority queue > used for merging. ShardIndex is essentially set to the ordinal of the > collector which generates the hit. This means that the shardIndex is > dependent on the number of collectors, even for the same set of hits. > > In case of no sort order specified, shardIndex is used for tie breaking when > scores are equal. This translates to different orders for same hits with > different shardIndices. > > I propose that we remove shardIndex from the default tie breaking mechanism > and replace it with docID. DocID order is the de facto that is expected > during collection, so it might make sense to use the same factor during tie > breaking when scores are the same. > > CC: [~ivera] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
[ https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861848#comment-16861848 ] Simon Willnauer edited comment on LUCENE-8829 at 6/12/19 8:56 AM: -- I'd remove the _setShardIndex_ parameter alltogether and don't set it was (Author: simonw): I'd remove the _ setShardIndex_ parameter alltogether and don't set it > TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved > - > > Key: LUCENE-8829 > URL: https://issues.apache.org/jira/browse/LUCENE-8829 > Project: Lucene - Core > Issue Type: Bug >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, > LUCENE-8829.patch > > > While investigating LUCENE-8819, I understood that TopDocs#merge's order of > results are indirectly dependent on the number of collectors involved in the > merge. This is troubling because 1) The number of collectors involved in a > merge are cost based and directly dependent on the number of slices created > for the parallel searcher case. 2) TopN hits code path will invoke merge with > a single Collector, so essentially, doing the same TopN query with single > threaded and parallel threaded searcher will invoke different order of > results, which is a bad invariant that breaks. > > The reason why this happens is because of the subtle way TopDocs#merge sets > shardIndex in the ScoreDoc population during populating the priority queue > used for merging. ShardIndex is essentially set to the ordinal of the > collector which generates the hit. This means that the shardIndex is > dependent on the number of collectors, even for the same set of hits. > > In case of no sort order specified, shardIndex is used for tie breaking when > scores are equal. This translates to different orders for same hits with > different shardIndices. > > I propose that we remove shardIndex from the default tie breaking mechanism > and replace it with docID. DocID order is the de facto that is expected > during collection, so it might make sense to use the same factor during tie > breaking when scores are the same. > > CC: [~ivera] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
[ https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861848#comment-16861848 ] Simon Willnauer commented on LUCENE-8829: - I'd remove the _ setShardIndex_ parameter alltogether and don't set it > TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved > - > > Key: LUCENE-8829 > URL: https://issues.apache.org/jira/browse/LUCENE-8829 > Project: Lucene - Core > Issue Type: Bug >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, > LUCENE-8829.patch > > > While investigating LUCENE-8819, I understood that TopDocs#merge's order of > results are indirectly dependent on the number of collectors involved in the > merge. This is troubling because 1) The number of collectors involved in a > merge are cost based and directly dependent on the number of slices created > for the parallel searcher case. 2) TopN hits code path will invoke merge with > a single Collector, so essentially, doing the same TopN query with single > threaded and parallel threaded searcher will invoke different order of > results, which is a bad invariant that breaks. > > The reason why this happens is because of the subtle way TopDocs#merge sets > shardIndex in the ScoreDoc population during populating the priority queue > used for merging. ShardIndex is essentially set to the ordinal of the > collector which generates the hit. This means that the shardIndex is > dependent on the number of collectors, even for the same set of hits. > > In case of no sort order specified, shardIndex is used for tie breaking when > scores are equal. This translates to different orders for same hits with > different shardIndices. > > I propose that we remove shardIndex from the default tie breaking mechanism > and replace it with docID. DocID order is the de facto that is expected > during collection, so it might make sense to use the same factor during tie > breaking when scores are the same. > > CC: [~ivera] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8829) TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved
[ https://issues.apache.org/jira/browse/LUCENE-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861821#comment-16861821 ] Simon Willnauer commented on LUCENE-8829: - I do wonder if we can simplify this API now that we have FunctionalInterfaces. If we change _TopDocs#merge_ to take a _ToIntFunction_ we should be able to have a default of _ScoreDoc::doc_ and users that want to use the shardindex can use _ScoreDoc::shardIndex_ that should also simplify our code I guess. Yet, I haven't check if it works across the board just an idea. > TopDocs#Merge is Tightly Coupled To Number Of Collectors Involved > - > > Key: LUCENE-8829 > URL: https://issues.apache.org/jira/browse/LUCENE-8829 > Project: Lucene - Core > Issue Type: Bug >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8829.patch, LUCENE-8829.patch, LUCENE-8829.patch, > LUCENE-8829.patch > > > While investigating LUCENE-8819, I understood that TopDocs#merge's order of > results are indirectly dependent on the number of collectors involved in the > merge. This is troubling because 1) The number of collectors involved in a > merge are cost based and directly dependent on the number of slices created > for the parallel searcher case. 2) TopN hits code path will invoke merge with > a single Collector, so essentially, doing the same TopN query with single > threaded and parallel threaded searcher will invoke different order of > results, which is a bad invariant that breaks. > > The reason why this happens is because of the subtle way TopDocs#merge sets > shardIndex in the ScoreDoc population during populating the priority queue > used for merging. ShardIndex is essentially set to the ordinal of the > collector which generates the hit. This means that the shardIndex is > dependent on the number of collectors, even for the same set of hits. > > In case of no sort order specified, shardIndex is used for tie breaking when > scores are equal. This translates to different orders for same hits with > different shardIndices. > > I propose that we remove shardIndex from the default tie breaking mechanism > and replace it with docID. DocID order is the de facto that is expected > during collection, so it might make sense to use the same factor during tie > breaking when scores are the same. > > CC: [~ivera] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8853) FileSwitchDirectory is broken if temp outputs are used
[ https://issues.apache.org/jira/browse/LUCENE-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861785#comment-16861785 ] Simon Willnauer commented on LUCENE-8853: - I attached a PR but I am not really happy with it, yet it's my best bet. I am wondering sure if we should start a discussion about removal of FileSwitchDirectory. It's hard to get right and there are many situtations where it can break. I do wonder what's it's usecase other than opening a file with NIO vs. MMAP as elasticsearch uses. If that's the main purpose we can build a better version of it. /cc [~rcmuir] > FileSwitchDirectory is broken if temp outputs are used > -- > > Key: LUCENE-8853 > URL: https://issues.apache.org/jira/browse/LUCENE-8853 > Project: Lucene - Core > Issue Type: Bug >Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > FileSwitchDirectory basically doesn't work if tmp output are used for files > that are explicitly mapped with extensions. here is a failing test: > {code} > 16:49:40[junit4] Suite: > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest > 16:49:40[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=BlendedInfixSuggesterTest > -Dtests.method=testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch > -Dtests.seed=16D8C93DC8FE5192 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=pt-LU -Dtests.timezone=US/Michigan -Dtests.asserts=true > -Dtests.file.encoding=ISO-8859-1 > 16:49:40[junit4] ERROR 0.05s J1 | > BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch > <<< > 16:49:40[junit4]> Throwable #1: > java.nio.file.AtomicMoveNotSupportedException: _0.fdx__0.tmp -> _0.fdx: > source and dest are in different directories > 16:49:40[junit4]> at > __randomizedtesting.SeedInfo.seed([16D8C93DC8FE5192:20E180A9490374CE]:0) > 16:49:40[junit4]> at > org.apache.lucene.store.FileSwitchDirectory.rename(FileSwitchDirectory.java:201) > 16:49:40[junit4]> at > org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) > 16:49:40[junit4]> at > org.apache.lucene.store.LockValidatingDirectoryWrapper.rename(LockValidatingDirectoryWrapper.java:56) > 16:49:40[junit4]> at > org.apache.lucene.store.TrackingDirectoryWrapper.rename(TrackingDirectoryWrapper.java:64) > 16:49:40[junit4]> at > org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) > 16:49:40[junit4]> at > org.apache.lucene.index.SortingStoredFieldsConsumer.flush(SortingStoredFieldsConsumer.java:56) > 16:49:40[junit4]> at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:152) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:555) > 16:49:40[junit4]> at > org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:722) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3199) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3444) > 16:49:40[junit4]> at > org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3409) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.commit(AnalyzingInfixSuggester.java:345) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.build(AnalyzingInfixSuggester.java:315) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.getBlendedInfixSuggester(BlendedInfixSuggesterTest.java:125) > 16:49:40[junit4]> at > org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch(BlendedInfixSuggesterTest.java:79) > 16:49:40[junit4]> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 16:49:40[junit4]> at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 16:49:40[junit4]> at &
[jira] [Created] (LUCENE-8853) FileSwitchDirectory is broken if temp outputs are used
Simon Willnauer created LUCENE-8853: --- Summary: FileSwitchDirectory is broken if temp outputs are used Key: LUCENE-8853 URL: https://issues.apache.org/jira/browse/LUCENE-8853 Project: Lucene - Core Issue Type: Bug Reporter: Simon Willnauer FileSwitchDirectory basically doesn't work if tmp output are used for files that are explicitly mapped with extensions. here is a failing test: {code} 16:49:40[junit4] Suite: org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest 16:49:40[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=BlendedInfixSuggesterTest -Dtests.method=testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch -Dtests.seed=16D8C93DC8FE5192 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=pt-LU -Dtests.timezone=US/Michigan -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 16:49:40[junit4] ERROR 0.05s J1 | BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch <<< 16:49:40[junit4]> Throwable #1: java.nio.file.AtomicMoveNotSupportedException: _0.fdx__0.tmp -> _0.fdx: source and dest are in different directories 16:49:40[junit4]> at __randomizedtesting.SeedInfo.seed([16D8C93DC8FE5192:20E180A9490374CE]:0) 16:49:40[junit4]> at org.apache.lucene.store.FileSwitchDirectory.rename(FileSwitchDirectory.java:201) 16:49:40[junit4]> at org.apache.lucene.store.MockDirectoryWrapper.rename(MockDirectoryWrapper.java:231) 16:49:40[junit4]> at org.apache.lucene.store.LockValidatingDirectoryWrapper.rename(LockValidatingDirectoryWrapper.java:56) 16:49:40[junit4]> at org.apache.lucene.store.TrackingDirectoryWrapper.rename(TrackingDirectoryWrapper.java:64) 16:49:40[junit4]> at org.apache.lucene.store.FilterDirectory.rename(FilterDirectory.java:89) 16:49:40[junit4]> at org.apache.lucene.index.SortingStoredFieldsConsumer.flush(SortingStoredFieldsConsumer.java:56) 16:49:40[junit4]> at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:152) 16:49:40[junit4]> at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:468) 16:49:40[junit4]> at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:555) 16:49:40[junit4]> at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:722) 16:49:40[junit4]> at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3199) 16:49:40[junit4]> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3444) 16:49:40[junit4]> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3409) 16:49:40[junit4]> at org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.commit(AnalyzingInfixSuggester.java:345) 16:49:40[junit4]> at org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.build(AnalyzingInfixSuggester.java:315) 16:49:40[junit4]> at org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.getBlendedInfixSuggester(BlendedInfixSuggesterTest.java:125) 16:49:40[junit4]> at org.apache.lucene.search.suggest.analyzing.BlendedInfixSuggesterTest.testBlendedSort_fieldWeightZero_shouldRankSuggestionsByPositionMatch(BlendedInfixSuggesterTest.java:79) 16:49:40[junit4]> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 16:49:40[junit4]> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 16:49:40[junit4]> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 16:49:40[junit4]> at java.base/java.lang.reflect.Method.invoke(Method.java:566) 16:49:40[junit4]> at java.base/java.lang.Thread.run(Thread.java:834) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8835) Respect file extension when listing files form FileSwitchDirectory
[ https://issues.apache.org/jira/browse/LUCENE-8835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8835. - Resolution: Fixed Assignee: Simon Willnauer Fix Version/s: 8.2 master (9.0) > Respect file extension when listing files form FileSwitchDirectory > -- > > Key: LUCENE-8835 > URL: https://issues.apache.org/jira/browse/LUCENE-8835 > Project: Lucene - Core > Issue Type: Bug > Reporter: Simon Willnauer > Assignee: Simon Willnauer >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 50m > Remaining Estimate: 0h > > FileSwitchDirectory splits file actions between 2 directories based on file > extensions. The extensions are respected on write operations like delete or > create but ignored when we list the content of the directories. Until now we > only deduplicated the contents on Directory#listAll which can cause > inconsistencies and hard to debug errors due to double deletions in > IndexWriter is a file is pending delete in one of the directories but still > shows up in the directory listing form the other directory. This case can > happen if both directories point to the same underlying FS directory which is > a common usecase to split between mmap and noifs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8833) Allow subclasses of MMapDirecory to preload individual IndexInputs
[ https://issues.apache.org/jira/browse/LUCENE-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858441#comment-16858441 ] Simon Willnauer commented on LUCENE-8833: - I do like the idea of #warm but the footprint is much bigger since it's a public API. I mean for my specific usecase I'd subclass mmap anyway and it would make it easier that way. FileSwitchDirectory is quite heavy and isn't really build for what I wanna do. I basically would need a IndexInput factory that I can plug into a directory that can alternate between NIOFS and mmap etc. and conditionally preload the mmap. Either way I can work with both I just think this change is the minimum viable change. lemme know if you are ok moving forward. > Allow subclasses of MMapDirecory to preload individual IndexInputs > -- > > Key: LUCENE-8833 > URL: https://issues.apache.org/jira/browse/LUCENE-8833 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I think it's useful for subclasses to select the preload flag on a per index > input basis rather than all or nothing. Here is a patch that has an > overloaded protected openInput method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8833) Allow subclasses of MMapDirecory to preload individual IndexInputs
[ https://issues.apache.org/jira/browse/LUCENE-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857525#comment-16857525 ] Simon Willnauer commented on LUCENE-8833: - > what would the iocontext provide to base the preload decision on? just > curious. sure, the one I had in mind as an example is merge. I am not sure if it makes a big difference I was just thinking if there are other signals than the file extension. I opened LUCENE-8835 to fix the file listing issue FileSwitchDirectory has. > Allow subclasses of MMapDirecory to preload individual IndexInputs > -- > > Key: LUCENE-8833 > URL: https://issues.apache.org/jira/browse/LUCENE-8833 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I think it's useful for subclasses to select the preload flag on a per index > input basis rather than all or nothing. Here is a patch that has an > overloaded protected openInput method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8835) Respect file extension when listing files form FileSwitchDirectory
Simon Willnauer created LUCENE-8835: --- Summary: Respect file extension when listing files form FileSwitchDirectory Key: LUCENE-8835 URL: https://issues.apache.org/jira/browse/LUCENE-8835 Project: Lucene - Core Issue Type: Bug Reporter: Simon Willnauer FileSwitchDirectory splits file actions between 2 directories based on file extensions. The extensions are respected on write operations like delete or create but ignored when we list the content of the directories. Until now we only deduplicated the contents on Directory#listAll which can cause inconsistencies and hard to debug errors due to double deletions in IndexWriter is a file is pending delete in one of the directories but still shows up in the directory listing form the other directory. This case can happen if both directories point to the same underlying FS directory which is a common usecase to split between mmap and noifs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8833) Allow subclasses of MMapDirecory to preload individual IndexInputs
[ https://issues.apache.org/jira/browse/LUCENE-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856781#comment-16856781 ] Simon Willnauer commented on LUCENE-8833: - you are correct that's what elasticsearch does. Yet, FileSwitchDirectory had many issues in the past and still has (I am working on one issue related to [this|https://github.com/elastic/elasticsearch/pull/37140] and will open another issue soon. Especially with the push of pending deletes down to FSDirectory things became more tricky for FileSwitchDirectory especially. That said I think these issue should be fixed and I will work on it it was more of a trigger to look closer. I also wanted to make decisions if you preload or not based on the IOContext down the road which FileSwitch would not be capable of doing in this context. I hope this makes sense? > Allow subclasses of MMapDirecory to preload individual IndexInputs > -- > > Key: LUCENE-8833 > URL: https://issues.apache.org/jira/browse/LUCENE-8833 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I think it's useful for subclasses to select the preload flag on a per index > input basis rather than all or nothing. Here is a patch that has an > overloaded protected openInput method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8833) Allow subclasses of MMapDirecory to preload individual IndexInputs
Simon Willnauer created LUCENE-8833: --- Summary: Allow subclasses of MMapDirecory to preload individual IndexInputs Key: LUCENE-8833 URL: https://issues.apache.org/jira/browse/LUCENE-8833 Project: Lucene - Core Issue Type: Improvement Reporter: Simon Willnauer I think it's useful for subclasses to select the preload flag on a per index input basis rather than all or nothing. Here is a patch that has an overloaded protected openInput method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8809) Refresh and rollback concurrently can leave segment states unclosed
[ https://issues.apache.org/jira/browse/LUCENE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856364#comment-16856364 ] Simon Willnauer commented on LUCENE-8809: - [~dnhatn] can we close this issue? > Refresh and rollback concurrently can leave segment states unclosed > --- > > Key: LUCENE-8809 > URL: https://issues.apache.org/jira/browse/LUCENE-8809 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.7, 8.1, 8.2 >Reporter: Nhat Nguyen >Assignee: Nhat Nguyen >Priority: Major > Fix For: 7.7.2, master (9.0), 8.2, 8.1.2 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > A [failed test|https://github.com/elastic/elasticsearch/issues/30290] from > Elasticsearch shows that refresh and rollback concurrently can leave segment > states unclosed leads to leaking refCount of some SegmentReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8813) testIndexTooManyDocs fails
[ https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8813. - Resolution: Fixed Fix Version/s: 8.2 master (9.0) > testIndexTooManyDocs fails > -- > > Key: LUCENE-8813 > URL: https://issues.apache.org/jira/browse/LUCENE-8813 > Project: Lucene - Core > Issue Type: Test > Components: core/index >Reporter: Nhat Nguyen >Priority: Major > Fix For: master (9.0), 8.2 > > Time Spent: 2.5h > Remaining Estimate: 0h > > testIndexTooManyDocs fails on [Elastic > CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console]. > This failure does not reproduce locally for me. > {noformat} > [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2> KTN 23, 2019 4:09:37 PM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-612,5,TGRP-TestIndexTooManyDocs] >[junit4] 2> java.lang.AssertionError: only modifications from the > current flushing queue are permitted while doing a full flush >[junit4] 2> at > __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) >[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) >[junit4] 2> at java.base/java.lang.Thread.run(Thread.java:834) >[junit4] 2> >[junit4] 2> KTN 23, 2019 6:09:36 PM > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate >[junit4] 2> WARNING: Suite execution timed out: > org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2>1) Thread[id=669, > name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > java.base/java.lang.Thread.getStackTrace(Thread.java:1606) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693) >[junit4] 2> at > java.base/java.security.AccessController.doPrivileged(Native Method) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.formatThreadStacksFull(ThreadLeakControl.java:689) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.access$1000(ThreadLeakControl.java:65) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:415) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:708) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:138) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:629) >[junit4] 2>2) Thread[id=671, name=Thread-606, state=BLOCKED, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > app//org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:4945) >[junit4] 2> at > app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293) >[junit4] 2> at > app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:
[jira] [Commented] (LUCENE-8813) testIndexTooManyDocs fails
[ https://issues.apache.org/jira/browse/LUCENE-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849506#comment-16849506 ] Simon Willnauer commented on LUCENE-8813: - I looked at this and I think the issue here is that we are executing 2 flushes very very quickly after another while at the same time a single thread has already released it's DWPT before the first flush but has not tried to applying deletes before the second flush is done. In this case the assertion doesn't hold anymore. The window is super small and that is likely why we never tripped this. I don't think we have a correctness issue here but I will still try to improve the way we assert/apply deletes. > testIndexTooManyDocs fails > -- > > Key: LUCENE-8813 > URL: https://issues.apache.org/jira/browse/LUCENE-8813 > Project: Lucene - Core > Issue Type: Test > Components: core/index >Reporter: Nhat Nguyen >Priority: Major > > testIndexTooManyDocs fails on [Elastic > CI|https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+branch_8x/6402/console]. > This failure does not reproduce locally for me. > {noformat} > [junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2> KTN 23, 2019 4:09:37 PM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException >[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-612,5,TGRP-TestIndexTooManyDocs] >[junit4] 2> java.lang.AssertionError: only modifications from the > current flushing queue are permitted while doing a full flush >[junit4] 2> at > __randomizedtesting.SeedInfo.seed([1F16B1DA7056AA52]:0) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.assertTicketQueueModification(DocumentsWriter.java:683) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.applyAllDeletes(DocumentsWriter.java:187) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:411) >[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:514) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) >[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) >[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) >[junit4] 2> at java.base/java.lang.Thread.run(Thread.java:834) >[junit4] 2> >[junit4] 2> KTN 23, 2019 6:09:36 PM > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate >[junit4] 2> WARNING: Suite execution timed out: > org.apache.lucene.index.TestIndexTooManyDocs >[junit4] 2>1) Thread[id=669, > name=SUITE-TestIndexTooManyDocs-seed#[1F16B1DA7056AA52], state=RUNNABLE, > group=TGRP-TestIndexTooManyDocs] >[junit4] 2> at > java.base/java.lang.Thread.getStackTrace(Thread.java:1606) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693) >[junit4] 2> at > java.base/java.security.AccessController.doPrivileged(Native Method) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getStackTrace(ThreadLeakControl.java:693) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.getThreadsWithTraces(ThreadLeakControl.java:709) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.formatThreadStacksFull(ThreadLeakControl.java:689) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl.access$1000(ThreadLeakControl.java:65) >[junit4] 2> at > com.carrotsearch.randomizedtesting.ThreadLeakControl$2.evaluate(ThreadLeakControl.java:415) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSuite(RandomizedRunner.java:708) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner.access$200(RandomizedRunner.java:138) >[junit4] 2> at > com.carrotsearch.randomizedtesting.RandomizedRunner$2.run(RandomizedRunner.java:629) >[junit4] 2>2) Thread[id=671, name=Thread-606, state=BLO
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843726#comment-16843726 ] Simon Willnauer commented on LUCENE-8757: - [~atris] can we instead of asserting the order just sort the slice in LeafSlice ctor? This should prevent any issues down the road and it's cheap enough IMO > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma > Assignee: Simon Willnauer >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, > LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
I think this should be done inside IndexSearcher. It’s a general problem, no? > On 13. May 2019, at 10:25, Adrien Grand (JIRA) wrote: > > >[ > https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838363#comment-16838363 > ] > > Adrien Grand commented on LUCENE-8757: > -- > > Yes. Top-docs collectors are expected to tie-break by doc ID in case > documents compare equal. Things like TopDocs#merge compare doc IDs explicitly > for that purpose, but Collector#collect implementations just rely on the fact > that documents are collected in order to ignore documents that compare equal > to the current k-th best hit. So we need to sort segments within a slice by > docBase in order to get the same top hits regardless of how slices have been > constructed. > >> Better Segment To Thread Mapping Algorithm >> -- >> >>Key: LUCENE-8757 >>URL: https://issues.apache.org/jira/browse/LUCENE-8757 >>Project: Lucene - Core >> Issue Type: Improvement >> Reporter: Atri Sharma >> Assignee: Simon Willnauer >> Priority: Major >>Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, >> LUCENE-8757.patch >> >> >> The current segments to threads allocation algorithm always allocates one >> thread per segment. This is detrimental to performance in case of skew in >> segment sizes since small segments also get their dedicated thread. This can >> lead to performance degradation due to context switching overheads. >> >> A better algorithm which is cognizant of size skew would have better >> performance for realistic scenarios > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-8757: --- Assignee: Simon Willnauer > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma > Assignee: Simon Willnauer >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, > LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837615#comment-16837615 ] Simon Willnauer commented on LUCENE-8757: - LGTM I will try to commit this in the coming days > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch, > LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837003#comment-16837003 ] Simon Willnauer commented on LUCENE-8757: - {quote} I think there is an important justification for the 2nd criteria (number of segments in each work unit / slice), which is if you have an index with some large segments, and then with a long tail of small segments (easily happens if your machine has substantially CPU concurrency and you use multiple threads), since there is a fixed cost for visiting each segment, if you put too many small segments into one work unit, those fixed costs multiply and that one work unit can become too slow even though it's not actually going to visit too many documents. I think we should keep it? {quote} fair enough. lets add it back > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835894#comment-16835894 ] Simon Willnauer commented on LUCENE-8785: - {quote} Please feel free to commit this to the release branch. In case of a re-spin, I'll pick this change up. {quote} [~ichattopadhyaya] done. Thanks. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Assignee: Simon Willnauer >Priority: Minor > Fix For: 7.7.2, master (9.0), 8.2, 8.1.1 > > Time Spent: 40m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835481#comment-16835481 ] Simon Willnauer commented on LUCENE-8757: - Thanks for the additional iteration, now that we simplified this can we remove the sorting? I don't necessearily see how the sort makes things simpler. If we see a segment > threshold we can just add it as a group? I though you did that already and hence my comment about the assertion. WDYT? I also want to suggest to beef up testing a bit with a randomized version of this like this: {code} diff --git a/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java b/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java index 7c63a817adb..76ccca64ee7 100644 --- a/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java +++ b/lucene/test-framework/src/java/org/apache/lucene/util/LuceneTestCase.java @@ -1933,6 +1933,14 @@ public abstract class LuceneTestCase extends Assert { ret = random.nextBoolean() ? new AssertingIndexSearcher(random, r, ex) : new AssertingIndexSearcher(random, r.getContext(), ex); + } else if (random.nextBoolean()) { +int maxDocPerSlice = 1 + random.nextInt(10); +ret = new IndexSearcher(r, ex) { + @Override + protected LeafSlice[] slices(List leaves) { +return slices(leaves, maxDocPerSlice); + } +}; } else { ret = random.nextBoolean() ? new IndexSearcher(r, ex) {code} > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835473#comment-16835473 ] Simon Willnauer commented on LUCENE-7840: - LGTM > BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least > 1 MUST/FILTER clause and 0==minShouldMatch > --- > > Key: LUCENE-7840 > URL: https://issues.apache.org/jira/browse/LUCENE-7840 > Project: Lucene - Core > Issue Type: Task >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-7840.patch, LUCENE-7840.patch, LUCENE-7840.patch > > > I haven't thought this through completely, let alone write up a patch / test > case, but IIUC... > We should be able to optimize {{ BooleanQuery rewriteNoScoring() }} so that > (after converting MUST clauses to FILTER clauses) we can check for the common > case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD > clauses as long as there is is at least one FILTER clause. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8785. - Resolution: Fixed > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless > Assignee: Simon Willnauer >Priority: Minor > Fix For: 7.7.2, master (9.0), 8.2, 8.1.1 > > Time Spent: 40m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-8785: Fix Version/s: (was: 8.0.1) (was: 8.1) (was: 7.7.1) 8.2 7.7.2 8.1.1 > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless > Assignee: Simon Willnauer >Priority: Minor > Fix For: 7.7.2, master (9.0), 8.2, 8.1.1 > > Time Spent: 40m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7840) BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least 1 MUST/FILTER clause and 0==minShouldMatch
[ https://issues.apache.org/jira/browse/LUCENE-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834778#comment-16834778 ] Simon Willnauer commented on LUCENE-7840: - I think there are some style issues in this patch like here were _else_ shoud be on the prev line: {code:java} + } +} +else { + newQuery.add(clause); +} {code} the other question is if we should use a switch instead of if / else? Otherwise it's looking fine > BooleanQuery.rewriteNoScoring - optimize away any SHOULD clauses if at least > 1 MUST/FILTER clause and 0==minShouldMatch > --- > > Key: LUCENE-7840 > URL: https://issues.apache.org/jira/browse/LUCENE-7840 > Project: Lucene - Core > Issue Type: Task >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-7840.patch, LUCENE-7840.patch > > > I haven't thought this through completely, let alone write up a patch / test > case, but IIUC... > We should be able to optimize {{ BooleanQuery rewriteNoScoring() }} so that > (after converting MUST clauses to FILTER clauses) we can check for the common > case of {{0==getMinimumNumberShouldMatch()}} and throw away any SHOULD > clauses as long as there is is at least one FILTER clause. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834767#comment-16834767 ] Simon Willnauer commented on LUCENE-8757: - [~atris] I think the assertion in this part doesn't hold: {code} +for (LeafReaderContext ctx : sortedLeaves) { + if (ctx.reader().maxDoc() > maxDocsPerSlice) { +assert group == null; +List singleSegmentSlice = new ArrayList(); {code} if the previous segment was smallish then _group_ is non-null? I think you should test these cases, maybe add a random test and randomize the order or the segments? This: {code} +List singleSegmentSlice = new ArrayList(); + +singleSegmentSlice.add(ctx); +groupedLeaves.add(singleSegmentSlice); {code} can and should be replaced by: {code} groupedLeaves.add(Collections.singletonList(ctx)); {code} otherwise it looks good. > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch, LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834525#comment-16834525 ] Simon Willnauer commented on LUCENE-8757: - [~atris] actually I thought about these defaults again and I am starting to think it's an ok default. The reason for this is that we try to prevent having dedicated threads for smallish segments so we group them together. I still do wonder if we need to have 2 parameters? Wouldn't it be enough to just say that we group things together until we have at least 250k docs per thread to be searched? is it really necessary to have another parameter that limits the number of segmetns per slice? I think a single parameter would be great and simpler. WDYT? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-8785: Fix Version/s: 7.7.1 master (9.0) 8.1 8.0.1 > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless > Assignee: Simon Willnauer >Priority: Minor > Fix For: 7.7.1, 8.0.1, 8.1, master (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer reassigned LUCENE-8785: --- Assignee: Simon Willnauer > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless > Assignee: Simon Willnauer >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834467#comment-16834467 ] Simon Willnauer commented on LUCENE-8785: - {quote} If there is another thread coming in after we locked the existent threadstates we just issue a new one. Yuck {quote} I looked at the code again and we actually lock the threadstates for this purpose. I implemented this in LUCENE-8639. The issue here is in-fact a race condition since we request the number of active threadstates before we lock new ones. It's a classic one-line fix. I referenced a PR for this. [~mikemccand] would you take a look > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1683#comment-1683 ] Simon Willnauer commented on LUCENE-8757: - > Would it make sense to push this patch, and then let users consume it and > provide feedback while we iterate on the more sophisticated version? We could > even have both of the methods available as options to users, potentially I don't think we should push this if we already know we wanna do something different. That said, I am not convinced the numbers are good defaults. At the same time I don't have any numbers here do you have anything to back these defaults up? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832343#comment-16832343 ] Simon Willnauer commented on LUCENE-8757: - Thanks [~atris], can you bring back the javadocs for {code:java} protected LeafSlice[] slices(List leaves){code} please don't reassign an argument like here: {code:java} leaves = new ArrayList<>(leaves); {code} The rest of the patch looks OK to me yet I am not so sure about the defaults. I do wonder if we should look at this from a different perspective. Rather than using hard numbers can we try to evenly balance the total number of documents across N threads and make N the variable? [~mikemccand] WDYT? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832336#comment-16832336 ] Simon Willnauer commented on LUCENE-8785: - {quote} I realize neither ES nor Solr expose deleteAll but I don't think that's a valid argument to remove it from Lucene. {quote} huh, I don't think that's a valid argument either, I just re-read my comments - sorry if you felt I was alluding to es or solr here. My argument is that if you want to do that you should construct a new IndexWriter instead of calling deleteAll(). Given this comment on the javadocs: {noformat} Essentially a call to {@link #deleteAll()} is equivalent to creating a new {@link IndexWriter} with {@link OpenMode#CREATE} {noformat} I want to understand why, in such a rather edgy case a user can't do exactly this. There is no race, no confusion it's very simple from a semantics perspective. Currently there are 2 ways and one if confusing. I think we should move towards removing the second way. {quote}And for some reason the index is reset once per week, but the devs want to allow searching of the old index while the new index is (slowly) built up. But if something goes badly wrong, they need to be able to rollback (the deleteAll and all subsequently added docs) to the last commit and try again later. If instead it succeeds, then a refresh/commit will switch to the new index atomically. {quote} Well, there are tons of ways to do that no? I mean you can have 2 directories? Yes it causes some engineering effort but the semantics would be cleaner even for the app that does what you explain. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831635#comment-16831635 ] Simon Willnauer commented on LUCENE-8785: - > But at the point we call clear() haven't we already blocked all indexing > threads? no, it might look like we do that but we don't. We block and lock all threads up that that point in time. If there is another thread coming in after we locked the existent threadstates we just issue a new one. > I also dislike deleteAll() and you're right a user could deleteByQuery using > MatchAllDocsQuery; can we make that close-ish as efficient as deleteAll() is > today? I think we can just do what deleteAll() does today except of not dropping the schema on the floor? > Though indeed that would preserve the schema, while deleteAll() let's you > delete docs, delete schema, all under transaction (the change is not visible > until commit). I want to understand the usecase for this. I can see how somebody wants to drop all docs but basically droping all IW state on the floor is difficult in my eyes. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8785) TestIndexWriterDelete.testDeleteAllNoDeadlock failure
[ https://issues.apache.org/jira/browse/LUCENE-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831612#comment-16831612 ] Simon Willnauer commented on LUCENE-8785: - [~mikemccand] I think this is caused by the fact that we simply call _clear()_ during _IW#deleteAll()_. If this happens concurrently to the a document being indexed this assertion can trip. I personally always disliked the complexity of _IW#deleteAll_ and from my perspective we should remove this method entirely and ask users to open a new IW if they want to drop all the information including the _schema_. We can still fast-path a _MatchAllQuery_ through something like this as we do today (which is a problem IMO since it drops all fields map info which it shouldn't?). IMO if you want a fresh index start from scratch but to delete all docs go and run DeleteByQueyr and keep the schema. > TestIndexWriterDelete.testDeleteAllNoDeadlock failure > - > > Key: LUCENE-8785 > URL: https://issues.apache.org/jira/browse/LUCENE-8785 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.6 > Environment: OpenJDK 1.8.0_202 >Reporter: Michael McCandless >Priority: Minor > > I was running Lucene's core tests on an {{i3.16xlarge}} EC2 instance (64 > cores), and hit this random yet spooky failure: > {noformat} > [junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexWriterDelete -Dtests.method=testDeleteAllNoDeadLock > -Dtests.seed=952BE262BA547C1 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ar-YE -Dtests.timezone=Europe/Lisbon -Dtests.as\ > serts=true -Dtests.file.encoding=US-ASCII > [junit4] ERROR 0.16s J3 | TestIndexWriterDelete.testDeleteAllNoDeadLock > <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=36, name=Thread-2, state=RUNNABLE, > group=TGRP-TestIndexWriterDelete] > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1:3A4B5138AB66FD97]:0) > [junit4] > Caused by: java.lang.RuntimeException: > java.lang.IllegalArgumentException: field number 0 is already mapped to field > name "null", not "content" > [junit4] > at > __randomizedtesting.SeedInfo.seed([952BE262BA547C1]:0) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:332) > [junit4] > Caused by: java.lang.IllegalArgumentException: field number > 0 is already mapped to field name "null", not "content" > [junit4] > at > org.apache.lucene.index.FieldInfos$FieldNumbers.verifyConsistent(FieldInfos.java:310) > [junit4] > at > org.apache.lucene.index.FieldInfos$Builder.getOrAdd(FieldInfos.java:415) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.getOrAddField(DefaultIndexingChain.java:650) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:428) > [junit4] > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394) > [junit4] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297) > [junit4] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450) > [junit4] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291) > [junit4] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264) > [junit4] > at > org.apache.lucene.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:159) > [junit4] > at > org.apache.lucene.index.TestIndexWriterDelete$1.run(TestIndexWriterDelete.java:326){noformat} > It does *not* reproduce unfortunately ... but maybe there is some subtle > thread safety issue in this code ... this is a hairy part of Lucene ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose
[ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831604#comment-16831604 ] Simon Willnauer commented on LUCENE-8776: - [~venkat11] I do understand your frustration. Believe me, we don't take changes like this easily. One persons bug is another persons feature and as we grow and mature strong guarantess are essential for a vast majority of users, for future developments for faster iterations and more performant code. There might not be a tradeoff from your perspective, from the maintainers perspective there is. Now we can debate if a major version bump is _enough_ time to migrate or not, our policy is that we can make BWC and behavioral changes like this in a major release. In-fact we don't do it in minors to provide you the time you need and to easy upgrades to minors. We will and have build features on top of this guarantee and in order to manage expectations I am pretty sure we won't go back an allow negative offsets. I think your best option, if you like it or not, is to work towards a fix for your issue with either the tools you have now or improve lucene for instance with the suggestion from [~mgibney] regarding indexing more information. Please don't get mad at me, I am just trying to manage expectations. > Start offset going backwards has a legitimate purpose > - > > Key: LUCENE-8776 > URL: https://issues.apache.org/jira/browse/LUCENE-8776 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 7.6 >Reporter: Ram Venkat >Priority: Major > > Here is the use case where startOffset can go backwards: > Say there is a line "Organic light-emitting-diode glows", and I want to run > span queries and highlight them properly. > During index time, light-emitting-diode is split into three words, which > allows me to search for 'light', 'emitting' and 'diode' individually. The > three words occupy adjacent positions in the index, as 'light' adjacent to > 'emitting' and 'light' at a distance of two words from 'diode' need to match > this word. So, the order of words after splitting are: Organic, light, > emitting, diode, glows. > But, I also want to search for 'organic' being adjacent to > 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. > The way I solved this was to also generate 'light-emitting-diode' at two > positions: (a) In the same position as 'light' and (b) in the same position > as 'glows', like below: > ||organic||light||emitting||diode||glows|| > | |light-emitting-diode| |light-emitting-diode| | > |0|1|2|3|4| > The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets > are obviously the same. This works beautifully in Lucene 5.x in both > searching and highlighting with span queries. > But when I try this in Lucene 7.6, it hits the condition "Offsets must not go > backwards" at DefaultIndexingChain:818. This IllegalArgumentException is > being thrown without any comments on why this check is needed. As I explained > above, startOffset going backwards is perfectly valid, to deal with word > splitting and span operations on these specialized use cases. On the other > hand, it is not clear what value is added by this check and which highlighter > code is affected by offsets going backwards. This same check is done at > BaseTokenStreamTestCase:245. > I see others talk about how this check found bugs in WordDelimiter etc. but > it also prevents legitimate use cases. Can this check be removed? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8757) Better Segment To Thread Mapping Algorithm
[ https://issues.apache.org/jira/browse/LUCENE-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831591#comment-16831591 ] Simon Willnauer commented on LUCENE-8757: - Hey Atri, thanks for putting up this patch, here is some additional feedback: - can we stick with an protected non-static method on IndexSearcher subclasses should be able to override your impl. I think it's ok to have a static method like this: {code:java} public static LeafSlice[] slices (List leaves, int maxDocsPerSlice, int maxSegPerSlice){code} that you can call from the protected method with your defaults? - you might want to change your sort to something like this: {code:java} Collections.sort(leaves, Collections.reverseOrder(Comparator.comparingInt(l -> l.reader().maxDoc(;{code} - I think the _Leaves_ class is unnecessary we can just use _List_ instead? > Better Segment To Thread Mapping Algorithm > -- > > Key: LUCENE-8757 > URL: https://issues.apache.org/jira/browse/LUCENE-8757 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Atri Sharma >Priority: Major > Attachments: LUCENE-8757.patch > > > The current segments to threads allocation algorithm always allocates one > thread per segment. This is detrimental to performance in case of skew in > segment sizes since small segments also get their dedicated thread. This can > lead to performance degradation due to context switching overheads. > > A better algorithm which is cognizant of size skew would have better > performance for realistic scenarios -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8671) Add setting for moving FST offheap/onheap
[ https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8671. - Resolution: Fixed Assignee: Simon Willnauer Fix Version/s: master (9.0) 8.1 > Add setting for moving FST offheap/onheap > - > > Key: LUCENE-8671 > URL: https://issues.apache.org/jira/browse/LUCENE-8671 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs, core/store >Reporter: Ankit Jain > Assignee: Simon Willnauer >Priority: Minor > Fix For: 8.1, master (9.0) > > Attachments: offheap_generic_settings.patch, offheap_settings.patch > > Original Estimate: 24h > Time Spent: 5h > Remaining Estimate: 19h > > While LUCENE-8635, adds support for loading FST offheap using mmap, users do > not have the flexibility to specify fields for which FST needs to be > offheap. This allows users to tune heap usage as per their workload. > Ideal way will be to add an attribute to FieldInfo, where we have > put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the > appropriate On/OffHeapStore when creating its FST. It can support special > keywords like ALL/NONE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8754) SegmentInfo#toString can cause ConcurrentModificationException
[ https://issues.apache.org/jira/browse/LUCENE-8754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8754. - Resolution: Fixed Fix Version/s: master (9.0) 8.1 > SegmentInfo#toString can cause ConcurrentModificationException > -- > > Key: LUCENE-8754 > URL: https://issues.apache.org/jira/browse/LUCENE-8754 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Simon Willnauer >Priority: Major > Fix For: 8.1, master (9.0) > > Time Spent: 3h > Remaining Estimate: 0h > > A recent change increased the likelihood for this issue to show up but it can > already happen before since we are using the attributes map in the > StoredFieldsFormat for quite some time. I found this issue due to a test > failure on our CI: > {noformat} > 13:11:56[junit4] Suite: org.apache.lucene.index.TestIndexSorting > 13:11:56[junit4] 2> apr 05, 2019 8:11:53 AM > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException > 13:11:56[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-507,5,TGRP-TestIndexSorting] > 13:11:56[junit4] 2> java.util.ConcurrentModificationException > 13:11:56[junit4] 2> at > __randomizedtesting.SeedInfo.seed([7C25B308F180203B]:0) > 13:11:56[junit4] 2> at > java.util.HashMap$HashIterator.nextNode(HashMap.java:1442) > 13:11:56[junit4] 2> at > java.util.HashMap$EntryIterator.next(HashMap.java:1476) > 13:11:56[junit4] 2> at > java.util.HashMap$EntryIterator.next(HashMap.java:1474) > 13:11:56[junit4] 2> at > java.util.AbstractMap.toString(AbstractMap.java:554) > 13:11:56[junit4] 2> at > org.apache.lucene.index.SegmentInfo.toString(SegmentInfo.java:222) > 13:11:56[junit4] 2> at > org.apache.lucene.index.SegmentCommitInfo.toString(SegmentCommitInfo.java:345) > 13:11:56[junit4] 2> at > org.apache.lucene.index.SegmentCommitInfo.toString(SegmentCommitInfo.java:364) > 13:11:56[junit4] 2> at java.lang.String.valueOf(String.java:2994) > 13:11:56[junit4] 2> at > java.lang.StringBuilder.append(StringBuilder.java:131) > 13:11:56[junit4] 2> at > java.util.AbstractMap.toString(AbstractMap.java:557) > 13:11:56[junit4] 2> at > java.util.Collections$UnmodifiableMap.toString(Collections.java:1493) > 13:11:56[junit4] 2> at java.lang.String.valueOf(String.java:2994) > 13:11:56[junit4] 2> at > java.lang.StringBuilder.append(StringBuilder.java:131) > 13:11:56[junit4] 2> at > org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:628) > 13:11:56[junit4] 2> at > org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2181) > 13:11:56[junit4] 2> at > org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2154) > 13:11:56[junit4] 2> at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1988) > 13:11:56[junit4] 2> at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1939) > 13:11:56[junit4] 2> at > org.apache.lucene.index.TestIndexSorting$UpdateRunnable.run(TestIndexSorting.java:1851) > 13:11:56[junit4] 2> at java.lang.Thread.run(Thread.java:748) > 13:11:56[junit4] 2> > 13:11:56[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestIndexSorting -Dtests.method=testConcurrentUpdates > -Dtests.seed=7C25B308F180203B -Dtests.slow=true -Dtest > {noformat} > The issue is that we update the attributes map (also we similarly do the same > for diagnostics but it's not necessarily causing the issue since the > diagnostics map is never modified) during the merge process but access it in > the merge policy when looking at running merges and there we call toString on > SegmentCommitInfo which happens without any synchronization. This is > technically unsafe publication but IW is a mess along those lines and real > fixes would require significant changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Tomoko Uchida as Lucene/Solr committer
Awesome! Welcome to this group as a committer! It’s always special to grow a committer base! Simon > On 9. Apr 2019, at 06:00, Tomás Fernández Löbbe wrote: > > Welcome! > >> On Mon, Apr 8, 2019 at 5:27 PM Christian Moen wrote: >> Congratulations, Tomoko-san! >> >>> On Tue, Apr 9, 2019 at 12:20 AM Uwe Schindler wrote: >>> Hi all, >>> >>> Please join me in welcoming Tomoko Uchida as the latest Lucene/Solr >>> committer! >>> >>> She has been working on https://issues.apache.org/jira/browse/LUCENE-2562 >>> for several years with awesome progress and finally we got the fantastic >>> Luke as a branch on ASF JIRA: >>> https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-2562-luke-swing-3 >>> Looking forward to the first release of Apache Lucene 8.1 with Luke bundled >>> in the distribution. I will take care of merging it to master and 8.x >>> branches together with her once she got the ASF account. >>> >>> Tomoko also helped with the Japanese and Korean Analyzers. >>> >>> Congratulations and Welcome, Tomoko! Tomoko, it's traditional for you to >>> introduce yourself with a brief bio. >>> >>> Uwe & Robert (who nominated Tomoko) >>> >>> - >>> Uwe Schindler >>> Achterdiek 19, D-28357 Bremen >>> https://www.thetaphi.de >>> eMail: u...@thetaphi.de >>> >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>
[jira] [Created] (LUCENE-8754) SegmentInfo#toString can cause ConcurrentModificationException
Simon Willnauer created LUCENE-8754: --- Summary: SegmentInfo#toString can cause ConcurrentModificationException Key: LUCENE-8754 URL: https://issues.apache.org/jira/browse/LUCENE-8754 Project: Lucene - Core Issue Type: Improvement Reporter: Simon Willnauer A recent change increased the likelihood for this issue to show up but it can already happen before since we are using the attributes map in the StoredFieldsFormat for quite some time. I found this issue due to a test failure on our CI: {noformat} 13:11:56[junit4] Suite: org.apache.lucene.index.TestIndexSorting 13:11:56[junit4] 2> apr 05, 2019 8:11:53 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException 13:11:56[junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-507,5,TGRP-TestIndexSorting] 13:11:56[junit4] 2> java.util.ConcurrentModificationException 13:11:56[junit4] 2> at __randomizedtesting.SeedInfo.seed([7C25B308F180203B]:0) 13:11:56[junit4] 2> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442) 13:11:56[junit4] 2> at java.util.HashMap$EntryIterator.next(HashMap.java:1476) 13:11:56[junit4] 2> at java.util.HashMap$EntryIterator.next(HashMap.java:1474) 13:11:56[junit4] 2> at java.util.AbstractMap.toString(AbstractMap.java:554) 13:11:56[junit4] 2> at org.apache.lucene.index.SegmentInfo.toString(SegmentInfo.java:222) 13:11:56[junit4] 2> at org.apache.lucene.index.SegmentCommitInfo.toString(SegmentCommitInfo.java:345) 13:11:56[junit4] 2> at org.apache.lucene.index.SegmentCommitInfo.toString(SegmentCommitInfo.java:364) 13:11:56[junit4] 2> at java.lang.String.valueOf(String.java:2994) 13:11:56[junit4] 2> at java.lang.StringBuilder.append(StringBuilder.java:131) 13:11:56[junit4] 2> at java.util.AbstractMap.toString(AbstractMap.java:557) 13:11:56[junit4] 2> at java.util.Collections$UnmodifiableMap.toString(Collections.java:1493) 13:11:56[junit4] 2> at java.lang.String.valueOf(String.java:2994) 13:11:56[junit4] 2> at java.lang.StringBuilder.append(StringBuilder.java:131) 13:11:56[junit4] 2> at org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:628) 13:11:56[junit4] 2> at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2181) 13:11:56[junit4] 2> at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2154) 13:11:56[junit4] 2> at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1988) 13:11:56[junit4] 2> at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1939) 13:11:56[junit4] 2> at org.apache.lucene.index.TestIndexSorting$UpdateRunnable.run(TestIndexSorting.java:1851) 13:11:56[junit4] 2> at java.lang.Thread.run(Thread.java:748) 13:11:56[junit4] 2> 13:11:56[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexSorting -Dtests.method=testConcurrentUpdates -Dtests.seed=7C25B308F180203B -Dtests.slow=true -Dtest {noformat} The issue is that we update the attributes map (also we similarly do the same for diagnostics but it's not necessarily causing the issue since the diagnostics map is never modified) during the merge process but access it in the merge policy when looking at running merges and there we call toString on SegmentCommitInfo which happens without any synchronization. This is technically unsafe publication but IW is a mess along those lines and real fixes would require significant changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8735) FileAlreadyExistsException after opening old commit
[ https://issues.apache.org/jira/browse/LUCENE-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801820#comment-16801820 ] Simon Willnauer commented on LUCENE-8735: - thanks henning > FileAlreadyExistsException after opening old commit > --- > > Key: LUCENE-8735 > URL: https://issues.apache.org/jira/browse/LUCENE-8735 > Project: Lucene - Core > Issue Type: Bug > Components: core/store >Affects Versions: 8.0 >Reporter: Henning Andersen >Assignee: Simon Willnauer >Priority: Major > Fix For: 7.7.1, 7.7.2, 8.0.1, 8.1, master (9.0) > > Time Spent: 40m > Remaining Estimate: 0h > > FilterDirectory.getPendingDeletes() does not delegate calls. This in turn > means that IndexFileDeleter does not consider those as relevant files. > When opening an IndexWriter for an older commit, excess files are attempted > deleted. If an IndexReader exists using one of the newer commits, the excess > files may fail to delete (at least on windows or when using the mocking > WindowsFS). > If then closing and opening the IndexWriter, the information on the pending > deletes are gone if a FilterDirectory derivate is used. At the same time, the > pending deletes are filtered out of listAll. This leads to a risk of hitting > an existing file name, causing a FileAlreadyExistsException. > This issue likely only exists on windows. > Will create pull request with fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8735) FileAlreadyExistsException after opening old commit
[ https://issues.apache.org/jira/browse/LUCENE-8735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8735. - Resolution: Fixed Assignee: Simon Willnauer Fix Version/s: 7.7.1 8.1 8.0.1 7.7.2 > FileAlreadyExistsException after opening old commit > --- > > Key: LUCENE-8735 > URL: https://issues.apache.org/jira/browse/LUCENE-8735 > Project: Lucene - Core > Issue Type: Bug > Components: core/store >Affects Versions: 8.0 >Reporter: Henning Andersen >Assignee: Simon Willnauer >Priority: Major > Fix For: 7.7.2, 8.0.1, 8.1, master (9.0), 7.7.1 > > Time Spent: 40m > Remaining Estimate: 0h > > FilterDirectory.getPendingDeletes() does not delegate calls. This in turn > means that IndexFileDeleter does not consider those as relevant files. > When opening an IndexWriter for an older commit, excess files are attempted > deleted. If an IndexReader exists using one of the newer commits, the excess > files may fail to delete (at least on windows or when using the mocking > WindowsFS). > If then closing and opening the IndexWriter, the information on the pending > deletes are gone if a FilterDirectory derivate is used. At the same time, the > pending deletes are filtered out of listAll. This leads to a risk of hitting > an existing file name, causing a FileAlreadyExistsException. > This issue likely only exists on windows. > Will create pull request with fix. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-master-Windows (32bit/jdk1.8.0_172) - Build # 7812 - Still Unstable!
I pushed a fix for this, sorry for the noise. test-bug On Thu, Mar 21, 2019 at 9:38 AM Dawid Weiss wrote: > Ping. Jenkins builds fail on an assertion related to the recent > changes in fst off-heap? > > D. > > On Thu, Mar 21, 2019 at 6:46 AM Policeman Jenkins Server > wrote: > > > > Build: https://jenkins.thetaphi.de/job/Lucene-Solr-master-Windows/7812/ > > Java: 32bit/jdk1.8.0_172 -client -XX:+UseG1GC > > > > 5 tests failed. > > FAILED: > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap > > > > Error Message: > > > > > > Stack Trace: > > java.lang.AssertionError > > at > __randomizedtesting.SeedInfo.seed([4086033C7FFFE0F2:5FC7DE43004D80CC]:0) > > at org.junit.Assert.fail(Assert.java:86) > > at org.junit.Assert.assertTrue(Assert.java:41) > > at org.junit.Assert.assertTrue(Assert.java:52) > > at > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap(TestBlockPostingsFormat.java:90) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) > > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) > > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) > > at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) > > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) > > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) > > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) > > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) > > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > > at > org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) > > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) > > at java.lang.Thread.run(Thread.java:748) > > > > > > FAILED: > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap > > > > Error Message: > > > > >
Re: [JENKINS] Lucene-Solr-8.x-Linux (32bit/jdk1.8.0_172) - Build # 288 - Still Unstable!
I pushed a fix for this, sorry for the noise On Thu, Mar 21, 2019 at 10:27 AM Policeman Jenkins Server < jenk...@thetaphi.de> wrote: > Build: https://jenkins.thetaphi.de/job/Lucene-Solr-8.x-Linux/288/ > Java: 32bit/jdk1.8.0_172 -client -XX:+UseSerialGC > > 6 tests failed. > FAILED: > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap > > Error Message: > > > Stack Trace: > java.lang.AssertionError > at > __randomizedtesting.SeedInfo.seed([418BE33A6217D2DD:5ECA3E451DA5B2E3]:0) > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap(TestBlockPostingsFormat.java:90) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) > at > org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) > at java.lang.Thread.run(Thread.java:748) > > > FAILED: > org.apache.lucene.codecs.lucene50.TestBlockPostingsFormat.testFstOffHeap > > Error Message: > > > Stack Trace: > java.lang.AssertionError > at > __randomizedtesting.SeedInfo.seed([418BE33A6217D2DD:5ECA3E451DA5B2E3]:0) > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.ass
Re: [VOTE] Master/9.0 to require Java 11
+1 - Java 8 EOLed last year - moving on in 2020 is reasonable and it's our responsibility to move with the platform we are running on. simon On Wed, Mar 20, 2019 at 9:27 AM Jan Høydahl wrote: > +1 > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > 19. mar. 2019 kl. 19:22 skrev Adrien Grand : > > Hello, > > Now that Lucene/Solr 8.0 has shipped I'd like us to consider requiring > Java 11 for 9.0, currently the master branch. We had 18 months between > 7.0 and 8.0, so if we assume a similar interval between 8.0 and 9.0 > that would mean releasing 9.0 about 2 years after Java 11, which > sounds like a conservative requirement to me. > > What do you think? > > Here is my +1. > > -- > Adrien > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > >
[jira] [Resolved] (LUCENE-8700) Enable concurrent flushing when no indexing is in progress
[ https://issues.apache.org/jira/browse/LUCENE-8700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8700. - Resolution: Invalid We settled on the PR that IndexWriter#flushNextBuffer is sufficient for this usecase. I opened a new PR for the test-improvements. here https://github.com/apache/lucene-solr/pull/607 > Enable concurrent flushing when no indexing is in progress > -- > > Key: LUCENE-8700 > URL: https://issues.apache.org/jira/browse/LUCENE-8700 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > As discussed on mailing list, this is for adding a IndexWriter.yield() method > that callers can use to enable concurrent flushing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() may not reflect all corrupting exceptions (notably: NoSuchFileException)
[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790895#comment-16790895 ] Simon Willnauer commented on LUCENE-8692: - > rollback gives you a way to close IndexWriter without doing a commit, which > seems useful. If you removed that, what would users do instead? can't we expend close to close without commit? I mean we can keep rollback but bet more strict about exceptions during the commit and friends? > IndexWriter.getTragicException() may not reflect all corrupting exceptions > (notably: NoSuchFileException) > - > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - see > SOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes this > decision after encountering an exception from the IndexWriter based on wether > {{IndexWriter.getTragicException()}} is (non-)null. > > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly – in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() may not reflect all corrupting exceptions (notably: NoSuchFileException)
[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789320#comment-16789320 ] Simon Willnauer commented on LUCENE-8692: - {quote} It definitely seems like there should be something we can/should do to better recognize situations like this as "unrecoverable" and be more strict in dealing with low level exceptions during things like commit – but I'm out definitely out of my depth in understanding/suggesting what that might look like. {quote} I agree with you here, I personally question the purpose of rollback since all the cases I have seen a missing rollback would simply mean dataloss. if somebody continues after a failed commit / prepareCommit / reopen they will end up with inconsistency and / or dataloss. I can't think of a reason why you would want to do it. I am curious what [~mikemccand] [~jpountz] [~rcmuir ] think about that. If we deprecated and remove rollback() we can be more agressive when it gets to tragic events and prevent users from continuing after such an exception by closing the writer automatically. > IndexWriter.getTragicException() may not reflect all corrupting exceptions > (notably: NoSuchFileException) > - > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - see > SOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes this > decision after encountering an exception from the IndexWriter based on wether > {{IndexWriter.getTragicException()}} is (non-)null. > > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly – in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8671) Add setting for moving FST offheap/onheap
[ https://issues.apache.org/jira/browse/LUCENE-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785703#comment-16785703 ] Simon Willnauer commented on LUCENE-8671: - I don't think we should add a setter to FieldInfo. This is a code-private thing and should be treated this way. This looks like we need to have a way to pass more info down when we open new SegmentReaders. I wonder if we can accept a simple Map on {noformat} public static DirectoryReader open(final IndexWriter writer, boolean applyAllDeletes, boolean writeAllDeletes) throws IOException {noformat} We can then pass it down to the relevant parts and make it part of `SegmentReaderState`? This map can also be passed via IndexWriterConfig for the NRT case. That way we can pass stuff per DirectoryReader open which is what we want I guess. > Add setting for moving FST offheap/onheap > - > > Key: LUCENE-8671 > URL: https://issues.apache.org/jira/browse/LUCENE-8671 > Project: Lucene - Core > Issue Type: New Feature > Components: core/FSTs, core/store >Reporter: Ankit Jain >Priority: Minor > Attachments: offheap_generic_settings.patch, offheap_settings.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > While LUCENE-8635, adds support for loading FST offheap using mmap, users do > not have the flexibility to specify fields for which FST needs to be > offheap. This allows users to tune heap usage as per their workload. > Ideal way will be to add an attribute to FieldInfo, where we have > put/getAttribute. Then FieldReader can inspect the FieldInfo and pass the > appropriate On/OffHeapStore when creating its FST. It can support special > keywords like ALL/NONE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)
[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785625#comment-16785625 ] Simon Willnauer commented on LUCENE-8692: - {quote} For now I've updated the patch to take the simplest possible approach to checking for MergeAbortedException {quote} +1 {quote} Well, to flip your question around: is there an example of a Throwable you can think of bubbling up out of IndexWriter.startCommit() that should NOT be considered fatal? {quote} I think we need to be careful here. From my perspective there are 3 types of exceptions here: * unrecoverable exceptions aka. VirtualMachineErrors * exceptions that happen during indexing and are not recoverable (these are handled in DocumentsWriter) * exceptions that cause dataloss or inconsistencies (we didn't handle those as fatal yet at least not consistently) but we only catch VirtualMachineError. Those are in particular: * getReader() * deleteAll() * addIndexes() * flushNextBuffer() * prepareCommitInternal() * doFlush() * startCommit() Those methods might cause documents go missing etc. but we treated them not as fatal or tragic events since a user could always call rollback() to go back the the last known safe-point / previous commit. Now we can debate if we want to change this and we can, in-fact I am all for making it even more strict especially since it's inconsistent with what we do if addDocument fails with an aborting exception. If we do that we need to see if rollback still has a purpose and maybe remove it? now speaking of maybeMerge I don't see why we need to close the index writer with a tragic event, there is no dataloss nor an inconsistency? From that logic I don't think we need to handle these exceptions in such a drastic way? {quote} I don't use github for lucene development – I track all contributions as patches in the official issue tracker for the project as recommended by our official guidelines : ) ... but i'll go ahead and create a jira/LUCENE-8692 branch if that will help you review. {quote} Bummer, I am not sure branches help. Working like it's still 1999 is a pain we should fix our guidelines. > IndexWriter.getTragicException() nay not reflect all corrupting exceptions > (notably: NoSuchFileException) > - > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - > seeSOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes > this decision after encountering an exception from the IndexWriter based on > wether {{IndexWriter.getTragicException()}} is (non-)null. > > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly -- in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8692) IndexWriter.getTragicException() nay not reflect all corrupting exceptions (notably: NoSuchFileException)
[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784438#comment-16784438 ] Simon Willnauer commented on LUCENE-8692: - {noformat} I think there is an issue with the patch with MergeAbortedExeption indeed given that registerMerge might throw such an exception. Maybe we should move this try block to registerMerge instead where we know which OneMerge is being registered (and is also where the exception is thrown when estimating the size of the merge). {noformat} +1 {code:java} -} catch (VirtualMachineError tragedy) { +} catch (Throwable tragedy) { tragicEvent(tragedy, "startCommit"); {code} I am not sure why we need to treat every exception as fatal in this case? I also wonder if we could move this to a PR on github, iterations would be simpler and comments too. I can't tell which patch is relevant which one isn't. > IndexWriter.getTragicException() nay not reflect all corrupting exceptions > (notably: NoSuchFileException) > - > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug >Reporter: Hoss Man >Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - > seeSOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes > this decision after encountering an exception from the IndexWriter based on > wether {{IndexWriter.getTragicException()}} is (non-)null. > > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly -- in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking
[ https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773011#comment-16773011 ] Simon Willnauer commented on LUCENE-3041: - [~romseygeek] any chance you can open a PR for this. Patches are so hard to review and comment on > Support Query Visting / Walking > --- > > Key: LUCENE-3041 > URL: https://issues.apache.org/jira/browse/LUCENE-3041 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 4.0-ALPHA >Reporter: Chris Male >Assignee: Simon Willnauer >Priority: Minor > Fix For: 4.9, 6.0 > > Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, > LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch > > > Out of the discussion in LUCENE-2868, it could be useful to add a generic > Query Visitor / Walker that could be used for more advanced rewriting, > optimizations or anything that requires state to be stored as each Query is > visited. > We could keep the interface very simple: > {code} > public interface QueryVisitor { > Query visit(Query query); > } > {code} > and then use a reflection based visitor like Earwin suggested, which would > allow implementators to provide visit methods for just Querys that they are > interested in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771013#comment-16771013 ] Simon Willnauer commented on LUCENE-8292: - [~dsmiley] I coordinated this with [~romseygeek] given that we had to respin for https://issues.apache.org/jira/browse/SOLR-13126 anyhow. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk, 8.0, 8.x, master (9.0) > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
I spoke to Alan about this before pushing and we have an unresolved solr blocker too > On 15. Feb 2019, at 22:56, David Smiley (JIRA) wrote: > > >[ > https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769780#comment-16769780 > ] > > David Smiley commented on LUCENE-8292: > -- > > Thanks Simon. I didn't think this could get in to 8.x at the last second or > I would have volunteered. FYI [~romseygeek] so you're aware. > >> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods >> -- >> >>Key: LUCENE-8292 >>URL: https://issues.apache.org/jira/browse/LUCENE-8292 >>Project: Lucene - Core >> Issue Type: Bug >> Components: core/index >> Affects Versions: 7.2.1 >> Reporter: Bruno Roustant >> Priority: Major >>Fix For: trunk, 8.0, 8.x, master (9.0) >> >>Attachments: >> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, >> LUCENE-8292.patch >> >> Time Spent: 0.5h >> Remaining Estimate: 0h >> >> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many >> methods. >> It misses some seekExact() methods, thus it is not possible to the delegate >> to override these methods to have specific behavior (unlike the TermsEnum >> API which allows that). >> The fix is straightforward: simply override these seekExact() methods and >> delegate. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8292. - Resolution: Fixed Fix Version/s: master (9.0) 8.x 8.0 > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk, 8.0, 8.x, master (9.0) > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769324#comment-16769324 ] Simon Willnauer commented on LUCENE-8292: - I opened a PR here https://github.com/apache/lucene-solr/pull/574 > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > Time Spent: 10m > Remaining Estimate: 0h > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767061#comment-16767061 ] Simon Willnauer commented on LUCENE-8292: - I do see both points here. [~dsmiley] I hate how trappy this is and [~jpountz] I completely agree with you. My suggestions here would be to add an additional class TermsEnum that has all methods abstract and BaseTermsEnum that can add default impls. FilterTermsEnum then subclasses TermsEnum and does the right thing. Other classes that don't need to override stuff like seekExact and seek(BytesRef, TermState) / TermState termState() can simply subclass BaseTermsEnum and we don't have to duplicate code all over the place. I don't think we need to do this in other places were we have the same pattern but in this case the traps are significant and we can fix it with a simple class in-between? > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8662) Change TermsEnum.seekExact(BytesRef) to abstract + delegate seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum
[ https://issues.apache.org/jira/browse/LUCENE-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763762#comment-16763762 ] Simon Willnauer commented on LUCENE-8662: - [~tomasflobbe] yes I think this should go into 8.0 - feel free to pull it in, I will do it next week once I am back at the keyboard. > Change TermsEnum.seekExact(BytesRef) to abstract + delegate > seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum > --- > > Key: LUCENE-8662 > URL: https://issues.apache.org/jira/browse/LUCENE-8662 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 5.5.5, 6.6.5, 7.6, 8.0 >Reporter: jefferyyuan >Priority: Major > Labels: query > Fix For: 8.0, 7.7 > > Attachments: output of test program.txt > > Time Spent: 50m > Remaining Estimate: 0h > > Recently in our production, we found that Solr uses a lot of memory(more than > 10g) during recovery or commit for a small index (3.5gb) > The stack trace is: > > {code:java} > Thread 0x4d4b115c0 > at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125) > at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V > (SegmentTermsEnumFrame.java:157) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnumFrame.java:786) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnumFrame.java:538) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnum.java:757) > at > org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (FilterLeafReader.java:185) > at > org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z > (TermsEnum.java:74) > at > org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J > (SolrIndexSearcher.java:823) > at > org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (VersionInfo.java:204) > at > org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (UpdateLog.java:786) > at > org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (VersionInfo.java:194) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z > (DistributedUpdateProcessor.java:1051) > {code} > We reproduced the problem locally with the following code using Lucene code. > {code:java} > public static void main(String[] args) throws IOException { > FSDirectory index = FSDirectory.open(Paths.get("the-index")); > try (IndexReader reader = new > ExitableDirectoryReader(DirectoryReader.open(index), > new QueryTimeoutImpl(1000 * 60 * 5))) { > String id = "the-id"; > BytesRef text = new BytesRef(id); > for (LeafReaderContext lf : reader.leaves()) { > TermsEnum te = lf.reader().terms("id").iterator(); > System.out.println(te.seekExact(text)); > } > } > } > {code} > > I added System.out.println("ord: " + ord); in > codecs.blocktree.SegmentTermsEnum.getFrame(int). > Please check the attached output of test program.txt. > > We found out the root cause: > we didn't implement seekExact(BytesRef) method in > FilterLeafReader.FilterTerms, so it uses the base class > TermsEnum.seekExact(BytesRef) implementation which is very inefficient in > this case. > {code:java} > public boolean seekExact(BytesRef text) throws IOException { > return seekCeil(text) == SeekStatus.FOUND; > } > {code} > The fix is simple, just override seekExact(BytesRef) method in > FilterLeafReader.FilterTerms > {code:java} > @Override > public boolean seekExact(BytesRef text) throws IOException { > return in.seekExact(text); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8664) Add equals/hashcode to TotalHits
[ https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8664. - Resolution: Fixed Fix Version/s: master (9.0) 8.0 > Add equals/hashcode to TotalHits > > > Key: LUCENE-8664 > URL: https://issues.apache.org/jira/browse/LUCENE-8664 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Fix For: 8.0, master (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > I think it would be convenient to add equals/hashcode methods to the > TotalHits class. I opened a PR here: > [https://github.com/apache/lucene-solr/pull/552] . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8664) Add equals/hashcode to TotalHits
[ https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756032#comment-16756032 ] Simon Willnauer commented on LUCENE-8664: - pushed - thanks [~lucacavanna] > Add equals/hashcode to TotalHits > > > Key: LUCENE-8664 > URL: https://issues.apache.org/jira/browse/LUCENE-8664 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > Fix For: 8.0, master (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > I think it would be convenient to add equals/hashcode methods to the > TotalHits class. I opened a PR here: > [https://github.com/apache/lucene-solr/pull/552] . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [DISCUSS] Opening old indices for reading
thanks folks, these are all good points. I created a first cut of what I had in mind [1] . It's relatively simple and from a java visibility perspective the only change that a user can take advantage of is this [2] and this [3] respectively. This would allow opening indices back to Lucene 7.0 given that the codecs and postings formats are available. From a documentation perspective I added [4]. Thisi s a pure read-only change and doesn't allow opening these indices for writing. You can't merge them neither would you be able to open an index writer on top of it. I still need to add support to Check-Index but that's what it is basically. lemme know what you think, simon [1] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752 [2] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e0352098b027d6f41a17c068ad8d7ef0R689 [3] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-e3ccf9ee90355b10f2dd22ce2da6c73cR306 [4] https://github.com/apache/lucene-solr/commit/0c4c885214ef30627a01e320f9c861dc2521b752#diff-1bedf4d0d52ff88ef8a16a6788ad7684R86 On Fri, Jan 25, 2019 at 3:14 PM Michael McCandless wrote: > > Another example is long ago Lucene allowed pos=-1 to be indexed and it caused > all sorts of problems. We also stopped allowing positions close to > Integer.MAX_VALUE (https://issues.apache.org/jira/browse/LUCENE-6382). Yet > another is allowing negative vInts which are possible but horribly > inefficient (https://issues.apache.org/jira/browse/LUCENE-3738). > > We do need to be free to fix these problems and then know after N+2 releases > that no index can have the issue. > > I like the idea of providing "expert" / best effort / limited way of carrying > forward such ancient indices, but I think the huge challenge for someone > using that tool on an important index will be enumerating the list of issues > that might "matter" (the 3 Adrien listed + the 3 I listed above is a start > for this list) and taking appropriate steps to "correct" the index if so. > E.g. on a norms encoding change, somehow these expert tools must decode norms > the old way, encode them the new way, and then rewrite the norms files. Or > if the index has pos=-1, changing that to pos=0. Or if it has negative > vInts, ... etc. > > Or maybe the "special" DirectoryReader only reads stored fields? And so you > would enumerate your _source and reindex into the latest format ... > > > Something like https://issues.apache.org/jira/browse/LUCENE-8277 would > > help make it harder to introduce corrupt data in an index. > > +1 > > Every time we catch something like "don't allow pos = -1 into the index" we > need somehow remember to go and add the check also in addIndices. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Jan 25, 2019 at 3:52 AM Adrien Grand wrote: >> >> Agreed with Michael that setting expectations is going to be >> important. The thing that I would like to make sure is that we would >> never refrain from moving Lucene forward because of this feature. In >> particular, lucene-core should be free to make assumptions that are >> valid for N and N-1 indices without worrying about the fact that we >> have this super-expert feature that allows opening older indices. Here >> are some assumptions that I have in mind which have not always been >> true: >> - norms might be encoded in a different way (this changed in 7) >> - all index files have a checksum (only true since Lucene 5) >> - offsets are always going forward (only enforced since Lucene 7) >> >> This means that carrying indices over by just merging them with the >> new version to move them to a new codec won't work all the time. For >> instance if your index has backward offsets and new codecs assume that >> offsets are going forward, then merging might fail or corrupt offsets >> - I'd like to make sure that we would not consider this a bug. >> >> Erick, I don't think this feature would be suitable for "robust index >> upgrades". To me it is really a best effort and shouldn't be trusted >> too much. >> >> I think some users will be tempted to wrap old readers to make them >> look good and then add them back to an index using addIndexes? >> Something like https://issues.apache.org/jira/browse/LUCENE-8277 would >> help make it harder to introduce corrupt data in an index. >> >> On Wed, Jan 23, 2019 at 3:11 PM Simon Willnauer >> wrote: >> > >> > Hey folks, >> > >> > tl;dr; I want to be able
[jira] [Commented] (LUCENE-8664) Add equals/hashcode to TotalHits
[ https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754987#comment-16754987 ] Simon Willnauer commented on LUCENE-8664: - [~lucacavanna] what's the usecase for this? Why are you trying to put this into a map or something? Can you explain this a bit further? > Add equals/hashcode to TotalHits > > > Key: LUCENE-8664 > URL: https://issues.apache.org/jira/browse/LUCENE-8664 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > > I think it would be convenient to add equals/hashcode methods to the > TotalHits class. I opened a PR here: > [https://github.com/apache/lucene-solr/pull/552] . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8662) Override seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum
[ https://issues.apache.org/jira/browse/LUCENE-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754984#comment-16754984 ] Simon Willnauer commented on LUCENE-8662: - {noformat} If we think that it's a trap, we should remove the default impl and make it abstract (in 8.0). {noformat} I agree with this. I think it can be trappy and such an expert API shouldn't. Let make it abstract? > Override seekExact(BytesRef) in FilterLeafReader.FilterTermsEnum > > > Key: LUCENE-8662 > URL: https://issues.apache.org/jira/browse/LUCENE-8662 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 5.5.5, 6.6.5, 7.6, 8.0 >Reporter: jefferyyuan >Priority: Major > Labels: query > Fix For: 8.0, 7.7 > > Attachments: output of test program.txt > > Time Spent: 10m > Remaining Estimate: 0h > > Recently in our production, we found that Sole uses a lot of memory(more than > 10g) during recovery or commit for a small index (3.5gb) > The stack trace is: > > {code:java} > Thread 0x4d4b115c0 > at org.apache.lucene.store.DataInput.readVInt()I (DataInput.java:125) > at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock()V > (SegmentTermsEnumFrame.java:157) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTermNonLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnumFrame.java:786) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.scanToTerm(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnumFrame.java:538) > at > org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (SegmentTermsEnum.java:757) > at > org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.seekCeil(Lorg/apache/lucene/util/BytesRef;)Lorg/apache/lucene/index/TermsEnum$SeekStatus; > (FilterLeafReader.java:185) > at > org.apache.lucene.index.TermsEnum.seekExact(Lorg/apache/lucene/util/BytesRef;)Z > (TermsEnum.java:74) > at > org.apache.solr.search.SolrIndexSearcher.lookupId(Lorg/apache/lucene/util/BytesRef;)J > (SolrIndexSearcher.java:823) > at > org.apache.solr.update.VersionInfo.getVersionFromIndex(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (VersionInfo.java:204) > at > org.apache.solr.update.UpdateLog.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (UpdateLog.java:786) > at > org.apache.solr.update.VersionInfo.lookupVersion(Lorg/apache/lucene/util/BytesRef;)Ljava/lang/Long; > (VersionInfo.java:194) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(Lorg/apache/solr/update/AddUpdateCommand;)Z > (DistributedUpdateProcessor.java:1051) > {code} > We reproduced the problem locally with the following code using Lucene code. > {code:java} > public static void main(String[] args) throws IOException { > FSDirectory index = FSDirectory.open(Paths.get("the-index")); > try (IndexReader reader = new > ExitableDirectoryReader(DirectoryReader.open(index), > new QueryTimeoutImpl(1000 * 60 * 5))) { > String id = "the-id"; > BytesRef text = new BytesRef(id); > for (LeafReaderContext lf : reader.leaves()) { > TermsEnum te = lf.reader().terms("id").iterator(); > System.out.println(te.seekExact(text)); > } > } > } > {code} > > I added System.out.println("ord: " + ord); in > codecs.blocktree.SegmentTermsEnum.getFrame(int). > Please check the attached output of test program.txt. > > We found out the root cause: > we didn't implement seekExact(BytesRef) method in > FilterLeafReader.FilterTerms, so it uses the base class > TermsEnum.seekExact(BytesRef) implementation which is very inefficient in > this case. > {code:java} > public boolean seekExact(BytesRef text) throws IOException { > return seekCeil(text) == SeekStatus.FOUND; > } > {code} > The fix is simple, just override seekExact(BytesRef) method in > FilterLeafReader.FilterTerms > {code:java} > @Override > public boolean seekExact(BytesRef text) throws IOException { > return in.seekExact(text); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[DISCUSS] Opening old indices for reading
Hey folks, tl;dr; I want to be able to open an indexreader on an old index if the SegmentInfo version is supported and all segment codecs are available. Today that's not possible even if I port old formats to current versions. Our BWC policy for quite a while has been N-1 major versions. That's good and I think we should keep it that way. Only recently, caused by changes how we encode/decode norms we also hard-enforce a the index-version-created in several places and the version a segment was written with. These are great enforcements and I understand why. My request here is if we can find consensus on allowing somehow (a special DirectoryReader for instance) to open such an index for reading only that doesn't provide the guarantees that our high level APIs decode norms correctly for instance. This would be enough to for instance consume stored fields etc. for reindexing or if a users are aware do they norms decoding in the codec. I am happy to work on a proposal how this would work. It would still enforce no writing or anything like this. I am also all for putting such a reader into misc and being experimental. simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8639) SeqNo accounting in IW is broken if many threads start indexing while we flush.
[ https://issues.apache.org/jira/browse/LUCENE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8639. - Resolution: Fixed Fix Version/s: master (9.0) 7.7 8.0 > SeqNo accounting in IW is broken if many threads start indexing while we > flush. > --- > > Key: LUCENE-8639 > URL: https://issues.apache.org/jira/browse/LUCENE-8639 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Simon Willnauer >Priority: Major > Fix For: 8.0, 7.7, master (9.0) > > Time Spent: 40m > Remaining Estimate: 0h > > While this is rare in the wild we have a test failure that shows that our > seqNo accounting is broken when we carry over seqNo to a new delete queue. > We had this test-failure: > {noformat} > 6:06:08[junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs > 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException > 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-8,5,TGRP-TestIndexTooManyDocs] > 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=7 vs maxSeqNo=6 > 16:06:08[junit4] 2> at > __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) > 16:06:08[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) > 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) > 16:06:08[junit4] 2> > 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException > 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-9,5,TGRP-TestIndexTooManyDocs] > 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=6 vs maxSeqNo=6 > 16:06:08[junit4] 2> at > __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) > 16:06:08[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) > 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) > 16:06:08[junit4] 2> > 16:06:08[junit4] 2> ??? 14, 2019 11:05:45 ? > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate > 16:06:08
[jira] [Commented] (LUCENE-8639) SeqNo accounting in IW is broken if many threads start indexing while we flush.
[ https://issues.apache.org/jira/browse/LUCENE-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743155#comment-16743155 ] Simon Willnauer commented on LUCENE-8639: - [~mikemccand] can you take a look at the PR? > SeqNo accounting in IW is broken if many threads start indexing while we > flush. > --- > > Key: LUCENE-8639 > URL: https://issues.apache.org/jira/browse/LUCENE-8639 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Simon Willnauer >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > While this is rare in the wild we have a test failure that shows that our > seqNo accounting is broken when we carry over seqNo to a new delete queue. > We had this test-failure: > {noformat} > 6:06:08[junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs > 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException > 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-8,5,TGRP-TestIndexTooManyDocs] > 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=7 vs maxSeqNo=6 > 16:06:08[junit4] 2> at > __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) > 16:06:08[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) > 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) > 16:06:08[junit4] 2> > 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? > com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler > uncaughtException > 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: > Thread[Thread-9,5,TGRP-TestIndexTooManyDocs] > 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=6 vs maxSeqNo=6 > 16:06:08[junit4] 2> at > __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) > 16:06:08[junit4] 2> at > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) > 16:06:08[junit4] 2> at > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) > 16:06:08[junit4] 2> at > org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) > 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) > 16:06:08[junit4] 2> > 16:06:08[junit4] 2> ??? 14, 2019 11:05:45 ? > com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate > 16:06:08[junit4] 2> WARNING: Suite execution timed out: >
[jira] [Created] (LUCENE-8639) SeqNo accounting in IW is broken if many threads start indexing while we flush.
Simon Willnauer created LUCENE-8639: --- Summary: SeqNo accounting in IW is broken if many threads start indexing while we flush. Key: LUCENE-8639 URL: https://issues.apache.org/jira/browse/LUCENE-8639 Project: Lucene - Core Issue Type: Improvement Reporter: Simon Willnauer While this is rare in the wild we have a test failure that shows that our seqNo accounting is broken when we carry over seqNo to a new delete queue. We had this test-failure: {noformat} 6:06:08[junit4] Suite: org.apache.lucene.index.TestIndexTooManyDocs 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-8,5,TGRP-TestIndexTooManyDocs] 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=7 vs maxSeqNo=6 16:06:08[junit4] 2> at __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) 16:06:08[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) 16:06:08[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) 16:06:08[junit4] 2> at org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) 16:06:08[junit4] 2> 16:06:08[junit4] 2> ??? 14, 2019 9:05:46 ? com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException 16:06:08[junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-9,5,TGRP-TestIndexTooManyDocs] 16:06:08[junit4] 2> java.lang.AssertionError: seqNo=6 vs maxSeqNo=6 16:06:08[junit4] 2> at __randomizedtesting.SeedInfo.seed([43B7C75B765AFEBD]:0) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.getNextSequenceNumber(DocumentsWriterDeleteQueue.java:482) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:168) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterDeleteQueue.add(DocumentsWriterDeleteQueue.java:146) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterPerThread.finishDocument(DocumentsWriterPerThread.java:362) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:264) 16:06:08[junit4] 2> at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:494) 16:06:08[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) 16:06:08[junit4] 2> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586) 16:06:08[junit4] 2> at org.apache.lucene.index.TestIndexTooManyDocs.lambda$testIndexTooManyDocs$0(TestIndexTooManyDocs.java:70) 16:06:08[junit4] 2> at java.lang.Thread.run(Thread.java:748) 16:06:08[junit4] 2> 16:06:08[junit4] 2> ??? 14, 2019 11:05:45 ? com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate 16:06:08[junit4] 2> WARNING: Suite execution timed out: org.apache.lucene.index.TestIndexTooManyDocs 16:06:08[junit4] 2>1) Thread[id=20, name=SUITE-TestIndexTooManyDocs-seed#[43B7C75B765AFEBD], state=RUNNABLE, group=TGRP-TestIndexTooManyDocs] 16:06:08[junit4] 2> at java.lang.Thread.getStackTrace(Thread.java:1559) 16:06:08[junit4] 2> at com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:696) 16:06:08[junit4] 2> at com.carrotsearch.randomizedtesting.ThreadLeakControl$4.run(ThreadLeakControl.java:693) 16:06:08[junit4] 2> at java.security.AccessController.doPrivileged(Native Method) 16:06:08[junit4] 2> at com.carrotsearch.
[jira] [Commented] (LUCENE-8525) throw more specific exception on data corruption
[ https://issues.apache.org/jira/browse/LUCENE-8525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740186#comment-16740186 ] Simon Willnauer commented on LUCENE-8525: - I do agree with [~rcmuir] here. There is not much to do in terms of detecting this particular problem on DataInput and friends. One way to improve this would certainly be the wording on the java doc. We can just clarify that detecting _CorruptIndexException_ is best effort. Another idea is to checksum the entire file before we read the commit we can either do this on the Elasticsearch end or improve _SegmentInfos#readCommit_ . Reading this file twice isn't a big deal I guess. > throw more specific exception on data corruption > > > Key: LUCENE-8525 > URL: https://issues.apache.org/jira/browse/LUCENE-8525 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Vladimir Dolzhenko >Priority: Major > > DataInput throws generic IOException if data looks odd > [DataInput:141|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/core/src/java/org/apache/lucene/store/DataInput.java#L141] > there are other examples like > [BufferedIndexInput:219|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/core/src/java/org/apache/lucene/store/BufferedIndexInput.java#L219], > > [CompressionMode:226|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java#L226] > and maybe > [DocIdsWriter:81|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java#L81] > That leads to some difficulties - see [elasticsearch > #34322|https://github.com/elastic/elasticsearch/issues/34322] > It would be better if it throws more specific exception. > As a consequence > [SegmentInfos.readCommit|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L281] > violates its own contract > {code:java} > /** >* @throws CorruptIndexException if the index is corrupt >* @throws IOException if there is a low-level IO error >*/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8609) Allow getting consistent docstats from IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722290#comment-16722290 ] Simon Willnauer commented on LUCENE-8609: - [~sokolov] I opened [https://github.com/mikemccand/luceneutil/pull/28/] /cc [~mikemccand] > Allow getting consistent docstats from IndexWriter > -- > > Key: LUCENE-8609 > URL: https://issues.apache.org/jira/browse/LUCENE-8609 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: master (8.0), 7.7 > Reporter: Simon Willnauer >Priority: Major > Fix For: master (8.0), 7.7 > > Time Spent: 50m > Remaining Estimate: 0h > > Today we have #numDocs() and #maxDoc() on IndexWriter. This is enough > to get all stats for the current index but it's subject to concurrency > and might return numbers that are not consistent ie. some cases can > return maxDoc < numDocs which is undesirable. This change adds a > getDocStats() > method to index writer to allow fetching consistent numbers for these > stats. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] [Commented] (LUCENE-8609) Allow getting consistent docstats from IndexWriter
What benchmarks are you talking about? Can you link them? > On 14. Dec 2018, at 23:47, Mike Sokolov (JIRA) wrote: > > >[ > https://issues.apache.org/jira/browse/LUCENE-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721848#comment-16721848 > ] > > Mike Sokolov commented on LUCENE-8609: > -- > > I think this will break nightly benchmarks? Anyway I'm currently getting > compile errors there > >> Allow getting consistent docstats from IndexWriter >> -- >> >>Key: LUCENE-8609 >>URL: https://issues.apache.org/jira/browse/LUCENE-8609 >>Project: Lucene - Core >> Issue Type: Improvement >> Affects Versions: master (8.0), 7.7 >> Reporter: Simon Willnauer >> Priority: Major >>Fix For: master (8.0), 7.7 >> >> Time Spent: 50m >> Remaining Estimate: 0h >> >> Today we have #numDocs() and #maxDoc() on IndexWriter. This is enough >>to get all stats for the current index but it's subject to concurrency >>and might return numbers that are not consistent ie. some cases can >>return maxDoc < numDocs which is undesirable. This change adds a >> getDocStats() >>method to index writer to allow fetching consistent numbers for these >> stats. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8609) Allow getting consistent docstats from IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer resolved LUCENE-8609. - Resolution: Fixed Fix Version/s: 7.7 master (8.0) thanks everybody > Allow getting consistent docstats from IndexWriter > -- > > Key: LUCENE-8609 > URL: https://issues.apache.org/jira/browse/LUCENE-8609 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: master (8.0), 7.7 > Reporter: Simon Willnauer >Priority: Major > Fix For: master (8.0), 7.7 > > Time Spent: 50m > Remaining Estimate: 0h > > Today we have #numDocs() and #maxDoc() on IndexWriter. This is enough > to get all stats for the current index but it's subject to concurrency > and might return numbers that are not consistent ie. some cases can > return maxDoc < numDocs which is undesirable. This change adds a > getDocStats() > method to index writer to allow fetching consistent numbers for these > stats. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org