[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226494#comment-13226494 ] Doron Cohen commented on LUCENE-3821: - {quote} Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this! {quote} Interesting! I was not aware of this. I went and read some about this automaton, It is relevant. {quote} It seems many things are the same but with a few twists: * fundamentally we are interleaving the streams from the subscorers into the 'index automaton' 'query automaton' is produced from the user-supplied terms {quote} True. In fact, the current code works hard to decide on the correct interleaving order - while if we had a Perfect Levenstein Automaton that took care of the computation we could just interleave, in the term position order (forget about phrase position and all that) and let the automaton compute the distance. This might capture the difficulty in making the sloppy phrase scorer correct: it started with the algorithm that was optimized for exact matching, and adopted (hacked?) it for approximate matching. Instead, starting with a model that fits approximate matching, might be easier and cleaner. I like that. {quote} * our 'alphabet' is the terms, and holes from position increment are just an additional symbol. * just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors * stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings) I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example... {quote} I agree. I'll think of this. In the meantime I'll commit this partial fix. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226623#comment-13226623 ] Doron Cohen commented on LUCENE-3821: - Committed: - r1299077 3x - r1299112 trunk bq. I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug. It's in now... bq. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example... Robert, this gave me an idea... currently, in case of collision between repeaters, we compare them and advance the lesser of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223077#comment-13223077 ] Doron Cohen commented on LUCENE-3821: - Thanks Robert, okay, I'll continue with option 2 then. In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer, for both performance and correctness reasons. In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for a match (still) holds. A sloppy version of this, with N terms and slop=S could increment differently: {noformat} 1 + N*Sat posIndex 1 + N*S - 1at posIndex-1 and posIndex+1 1 + N*S - 2 at posIndex-2 and posIndex+3 ... 1 + N*S - S at posIndex-S and posIndex+S {noformat} For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer, which makes sense. Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up 1 for term k: * count[posIndex] == k-1 Or, after adding up 1 for term k: * count[posIndex] == k In the sloppy version the criteria (after adding up term k) would be: * count[posIndex] = k*(1+N*S)-S Again, for S=0 this falls to the ExactPhraseScorer logic: * count[posIndex] = k Mike (and all), correctness wise, what do you think? If you are wondering why the increment at the actual position is (1 + N*S) - it allows to match only posIndexes where all terms contributed something. I drew an example with 5 terms and slop=2 and so far it seems correct. Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match, several posIndexes will contribute to the score of the same match. I think this is not too bad, as for these parameters, same behavior would be in all documents. I would be especially forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer. If we agree on correctness, need to understand how to implement it, and consider the performance effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the max increment is: - 1 + N*S = 1 + 3*2 = 7. So the increments for this term would be (pos, incr): {noformat} Pos Increment --- - 8 , 5 9 , 6 10 , 7 11 , 6 12 , 5 13 , 5 14 , 6 15 , 7 = max(7,5) 16 , 6 = max(6,5) 17 , 6 = max(5,6) 18 , 7 19 , 6 20 , 5 {noformat} So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So some look-ahead (or some fix-back) is required, which will complicate the code. If this seems promising, should probably open a new issue for it, just wanted to get some feedback first. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223554#comment-13223554 ] Doron Cohen commented on LUCENE-3821: - OK great! If you did not point a problem with this up front there's a good chance it will work and I'd like to give it a try. I have a first patch - not working or anything - it opens ExactPhraseScorer a bit for extensions and adds a class (temporary name) - NonExactPhraseScorer. The idea is to hide in the ChunkState the details of decaying frequencies due to slops. I will try it over the weekend. If we can make it this way, I'd rather do it in this issue rather than committing the other new code for the fix and then replacing it. If that won't go quick, I'll commit the (other) changes to SloppyPhraseScorer and start a new issue. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223700#comment-13223700 ] Doron Cohen commented on LUCENE-3821: - I'm afraid it won't solve the problem. The complicity of SloppyPhraseScorer stems firstly from the slop. That part is handled in the scorer for long time. Two additional complications are repeating terms, and multi-term phrases. Each one of these, separately, is handled as well. Their combination however, is the cause for this discussion. To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query). Without this special treatment, a phrase query a b a~2 might match a document a b, because both a's (query terms) will land on the same document's a. This is illegal and is prevented by such propagation. But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug. Now, back to current ExactPhraseScorer. It does not have this problem with repeating terms. But not because of the different algorithm - rather because of the different scenario. It does not have this problem because exact phrase scoring does not have it. In exact phrase scoring, a match is declared only when all PPs are in the same phrase position. Recall that phrase position = doc-position - query-offset, it is visible that when two PPs with different query offset are in the same phrase-position, their doc-position cannot be the same, and therefore no special handling is needed for repeating terms in exact phrase scorers. However, once we will add that slopy-decaying frequency, we will match in certain posIndex, different phrase-positions. This is because of the slop. So they might land on the same doc-position, and then we start again... This is really too bad. Sorry for the lengthy post, hopefully this would help when someone wants to get into this. Back to option 2. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221840#comment-13221840 ] Doron Cohen commented on LUCENE-3821: - The remaining failure still happens with the updated patch (same seed), and still seems to me like an ExactPhraseScorer bug. However, it is probably not a simple one I think, because when adding to TestMultiPhraseQuery, it passes, that is, no documents are matched, as expected, although this supposedly created the exact scenario that failed above. Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so far. {code:title=Add this to TestMultiPhraseQuery} public void test_LUCENE_XYZ() throws Exception { Directory indexStore = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random, indexStore); add(s o b h j t j z o, LUCENE-XYZ, writer); IndexReader reader = writer.getReader(); IndexSearcher searcher = newSearcher(reader); MultiPhraseQuery q = new MultiPhraseQuery(); q.add(new Term[] {new Term(body, j), new Term(body, o), new Term(body, s)}); q.add(new Term[] {new Term(body, i), new Term(body, b), new Term(body, j)}); q.add(new Term[] {new Term(body, t), new Term(body, d)}); assertEquals(Wrong number of hits, 0, searcher.search(q, null, 1).totalHits); // just make sure no exc: searcher.explain(q, 0); writer.close(); searcher.close(); reader.close(); indexStore.close(); } {code} SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221867#comment-13221867 ] Doron Cohen commented on LUCENE-3821: - Update: apparently MultiPhraseQuery.toString does not print its holes. So the query that failed was not: {noformat}field:(j o s) (i b j) (t d){noformat} But rather: {noformat}(j o s) ? (i b j) ? ? (t d){noformat} Which is a different story: this query should match the document {noformat}s o b h j t j z o{noformat} There is a match for ExactPhraseScorer, but not for Sloppy with slope 1. So there is still work to do on SloppyPhraseScorer... (I'll fix MFQ.toString() as well) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221879#comment-13221879 ] Doron Cohen commented on LUCENE-3821: - I think I understand the cause. In current implementation there is an assumption that once we landed on the first candidate document, it is possible to check if there are repeating pps, by just comparing the in-doc-positions of the terms. Tricky as it is, while this is true for plain PhrasePositions, it is not true for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first candidate document that starts with a b. The in-doc-positions of the two pps would be 0,1 respectively (for 'a' and 'b') and we would not even detect the fact that there are repetitions, not to mention not putting them in the same group. MPPs conflicts with current patch in an additional manner: It is now assumed that each repetition can be assigned a repetition group. So assume these PPs (and query positions): 0:a 1:b 3:a 4:b 7:c There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, while 7:c is not a repetition. But assume these PPs (and query positions): 0:(a b) 1:(b x) 3:a 4:b 7:(c x) We end up with a single large repetition group: {0:(a b) 1:(b x) 3:a 4:b 7:(c x)} I think if the groups are created correctly at the first candidate document, scorer logic would still work, as a collision is decided only when two pps are in the same in-doc-position. The only impact of MPPs would be performance cost: since repetition groups are larger, it would take longer to check if there are repetitions. Just need to figure out how to detect repetition groups without relying on in-(first-)doc-positions. SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221737#comment-13221737 ] Doron Cohen commented on LUCENE-3821: - I understand the problem. It all has to do - as Robert mentioned - with the repeating terms in the phrase query. I am working on a patch - it will change the way that repeats are handled. Repeating PPs require additional computation - and current SloppyPhraseScorer attempts to do that additional work efficiently, but over simplifies in that and fail to cover all cases. In the core of things, each time a repeating PP is selected (from the queue) and propagated, *all* its sibling repeaters are propagated as well, to prevent a case that two repeating PPs point to the same document position (which was the bug that originally triggered handling repeats in this code). But this is wrong, because it propagates all siblings repeaters, and misses some cases. Also, the code is hard to read, as Mike noted in LUCENE-2410 ([this comment|https://issues.apache.org/jira/browse/LUCENE-2410?focusedCommentId=12879443page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12879443]) ). So this is a chance to also make the code more maintainable. I have a working version which is not ready to commit yet, and all the tests pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i am missing something. The case that fails is this: {noformat} AssertionError: Missing in super-set: doc 706 q1: field:(j o s) (i b j) (t d) q2: field:(j o s) (i b j) (t d)~1 td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 shardIndex=-1] td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 shardIndex=-1, doc=758 score=4.400081 shardIndex=-1] doc 706: Documentstored,indexed,tokenizedfield:s o b h j t j z o {noformat} It seems that q1 too should not match this document? SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Assignee: Doron Cohen Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
[ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219156#comment-13219156 ] Doron Cohen commented on LUCENE-3821: - Fails here too like this: ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 -Dargs=-Dfile.encoding=UTF-8 I'll look into it SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. --- Key: LUCENE-3821 URL: https://issues.apache.org/jira/browse/LUCENE-3821 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.5, 4.0 Reporter: Naomi Dushay Attachments: LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml The general bug is a case where a phrase with no slop is found, but if you add slop its not. I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it. ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()
[ https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201073#comment-13201073 ] Doron Cohen commented on LUCENE-3746: - Thanks Dawid! {quote} it's probably a system daemon thread for sending memory threshold notifications {quote} Yes this makes sense. Still the difference between the two JDKs felt bothering. Some more digging, and now I think it is clear. Here are the stack traces reported (at the end of the test) with Oracle: {noformat} 1. Thread[ReaderThread,5,main] 2. Thread[main,5,main] 3. Thread[Reference Handler,10,system] 4. Thread[Signal Dispatcher,9,system] 5. Thread[Finalizer,8,system] 6. Thread[Attach Listener,5,system] {noformat} And with IBM JDK: {noformat} 1. Thread[Attach API wait loop,10,main] 2. Thread[Finalizer thread,5,system] 3. Thread[JIT Compilation Thread,10,system] 4. Thread[main,5,main] 5. Thread[Gc Slave Thread,5,system] 6. Thread[ReaderThread,5,main] 7. Thread[Signal Dispatcher,5,main] 8. Thread[MemoryPoolMXBean notification dispatcher,6,main] {noformat} The 8th thread is the one that started only after accessing the memory management layer. The thing is, that in the IBM JDK that thread is created in the ThreadGroup main, while in the Oracle JDK it is created under system. To me the latter makes more sense. To be more sure I added a fake memory notification listener and check the thread in which notification happens: {code} MemoryMXBean mmxb = ManagementFactory.getMemoryMXBean(); NotificationListener listener = new NotificationListener() { @Override public void handleNotification(Notification notification, Object handback) { System.out.println(Thread.currentThread()); } }; ((NotificationEmitter) mmxb).addNotificationListener(listener, null, null); {code} Evidently in IBM JDK the notification is in main group thread (also in line with the thread-group in the original warning message which triggered this threads discussion): {noformat} Thread[MemoryPoolMXBean notification dispatcher,6,main] {noformat} While in Oracle JDK notification is in system group thread: {noformat} Thread[Low Memory Detector,9,system] {noformat} This also explains why the warning is reported only for IBM JDK: because the threads check in LTC only account for the threads in the same thread-group as the one running the specific test case. So when dispatching happens in a system group thread it is not sensed by that check at all. Ok now with mystery solved I can commit the simpler code... suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory() -- Key: LUCENE-3746 URL: https://issues.apache.org/jira/browse/LUCENE-3746 Project: Lucene - Java Issue Type: Bug Components: modules/spellchecker Reporter: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3746.patch, LUCENE-3746.patch, LUCENE-3746.patch Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3746) suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory()
[ https://issues.apache.org/jira/browse/LUCENE-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13199038#comment-13199038 ] Doron Cohen commented on LUCENE-3746: - {quote} [Dawid:|http://markmail.org/message/jobtemqm4u4vrxze] (maxMemory - totalMemory) because that's how much the heap can grow? The problem is none of this is atomic, so the result can unpredictable. There are other methods in management interface that permit a somewhat more detailed checks. Don't know if they guarantee atomicity of the returned snapshot, but I doubt it. - [MemoryMXBean.getHeapMemoryUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryMXBean.html#getHeapMemoryUsage()] - [MemoryPoolMXBean.getPeakUsage()|http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/MemoryPoolMXBean.html#getPeakUsage()] {quote} Current patch not (yet) handling the atomicity issue Dawid described. suggest.fst.Sort.BufferSize should not automatically fail just because of freeMemory() -- Key: LUCENE-3746 URL: https://issues.apache.org/jira/browse/LUCENE-3746 Project: Lucene - Java Issue Type: Bug Components: modules/spellchecker Reporter: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3746.patch Follow up op dev thread: [FSTCompletionTest failure At least 0.5MB RAM buffer is needed | http://markmail.org/message/d7ugfo5xof4h5jeh] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196845#comment-13196845 ] Doron Cohen commented on LUCENE-1812: - while merging to trunk I noticed that idea's settings for modules/queries and modules/queryparser refer to lucene/contrib instead of modules. Seems trivial to fix but I have no Idea installed at the moment so no way to verify. Created LUCENE-3737 to handle that later. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix
[ https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196858#comment-13196858 ] Doron Cohen commented on LUCENE-3737: - In dev-tools/idea/.idea/ant.xml there are these two: {code} buildFile url=file://$PROJECT_DIR$/lucene/contrib/queries/build.xml / buildFile url=file://$PROJECT_DIR$/lucene/contrib/queryparser/build.xml / {code} I assume this has the potential to break an Idea setup, but haven't tried it yet, just wanted to not forget about it, therefore this issue. Is this a none-issue? Idea modules settings - verify and fix -- Key: LUCENE-3737 URL: https://issues.apache.org/jira/browse/LUCENE-3737 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Doron Cohen Assignee: Doron Cohen Priority: Trivial Idea's settings for modules/queries and modules/queryparser refer to lucene/contrib instead of modules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3737) Idea modules settings - verify and fix
[ https://issues.apache.org/jira/browse/LUCENE-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197014#comment-13197014 ] Doron Cohen commented on LUCENE-3737: - Yes, only saw this on trunk, thanks for taking care of this! Idea modules settings - verify and fix -- Key: LUCENE-3737 URL: https://issues.apache.org/jira/browse/LUCENE-3737 Project: Lucene - Java Issue Type: Bug Affects Versions: 4.0 Reporter: Doron Cohen Assignee: Steven Rowe Priority: Trivial Fix For: 4.0 Idea's settings for modules/queries and modules/queryparser refer to lucene/contrib instead of modules. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196339#comment-13196339 ] Doron Cohen commented on LUCENE-1812: - That dead code was removed and some javadocs added. Still room for more javadocs - e.g. the static tool - and better test coverage. Committed to 3x: r1237937. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196429#comment-13196429 ] Doron Cohen commented on LUCENE-1812: - bq. Excellent, thanks for seeing this through! Yeah, only more than a year delay ;) BTW in trunk it will be under modules. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed
[ https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192031#comment-13192031 ] Doron Cohen commented on LUCENE-3718: - well this is not a test bug after all, but rather exposing a bug in Lucene40PostingsReader. SamplingWrapperTest failure with certain test seed -- Key: LUCENE-3718 URL: https://issues.apache.org/jira/browse/LUCENE-3718 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 3.6, 4.0 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/ 1 tests failed. REGRESSION: org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping Error Message: Results are not the same! Stack Trace: org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the same! at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {$facets=PostingsFormat(name=MockRandom), $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 minBlockSize=65 maxBlockSize=209), $payloads$=PostingsFormat(name=Lucene40WithOrds)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, timezone=Asia/Manila -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed
[ https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192046#comment-13192046 ] Doron Cohen commented on LUCENE-3718: - Fix committed in r1235190 (trunk). Added no CHANGES entry - seems to me an overkill here... other opinions? SamplingWrapperTest failure with certain test seed -- Key: LUCENE-3718 URL: https://issues.apache.org/jira/browse/LUCENE-3718 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: LUCENE-3718.patch, LUCENE-3718.patch Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/ 1 tests failed. REGRESSION: org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping Error Message: Results are not the same! Stack Trace: org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the same! at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {$facets=PostingsFormat(name=MockRandom), $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 minBlockSize=65 maxBlockSize=209), $payloads$=PostingsFormat(name=Lucene40WithOrds)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, timezone=Asia/Manila -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192101#comment-13192101 ] Doron Cohen commented on LUCENE-1812: - I ran 'javadocs' under 3x/lucene/contrib/pruning and 'javadocs-all' under 3x/lucene. The latter failed due to multiple package.html under o.a.l.index - in core and under contrib/pruning. Entirely renaming the package to o.a.l.pruning.index won't work because PruningReader accesses package protected SegmentTermVector. I can move the other classes to that new package and keep only PruningReader in that index friend package. (Unless there are javadoc/ant tricks that will avoid this error and still generate valid javadocs in both cases). Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191206#comment-13191206 ] Doron Cohen commented on LUCENE-1812: - Getting to this, at last. I did not handle the above TODO's and I rather commit so they can be handled later separately (progress not perfection as Mike says). Changes in this patch: - PruningReader overrides also getSequentialSubReaders(), otherwise no pruning takes place on sub-readers (and tests fail). - StorePruningPolicy fixed to use FieldInfos API. I modified for Idea and maven by following templates for other contrib components but have no way to test this and would appreciate a review of this. Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191209#comment-13191209 ] Doron Cohen commented on LUCENE-1812: - I now see that all other contrib components have svn:ignore for *.iml and pom.xml - I'll add that for pruning as well (though it is not in the attached patch). Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1812) Static index pruning by in-document term frequency (Carmel pruning)
[ https://issues.apache.org/jira/browse/LUCENE-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191222#comment-13191222 ] Doron Cohen commented on LUCENE-1812: - bq. I didn't test them, but I will once they have been committed. Great, thanks! Static index pruning by in-document term frequency (Carmel pruning) --- Key: LUCENE-1812 URL: https://issues.apache.org/jira/browse/LUCENE-1812 Project: Lucene - Java Issue Type: New Feature Components: modules/other Reporter: Andrzej Bialecki Assignee: Doron Cohen Fix For: 3.6, 4.0 Attachments: pruning.patch, pruning.patch, pruning.patch, pruning.patch, pruning.patch This module provides tools to produce a subset of input indexes by removing postings data for those terms where their in-document frequency is below a specified threshold. The net effect of this processing is a much smaller index that for common types of queries returns nearly identical top-N results as compared with the original index, but with increased performance. Optionally, stored values and term vectors can also be removed. This functionality is largely independent, so it can be used without term pruning (when term freq. threshold is set to 1). As the threshold value increases, the total size of the index decreases, search performance increases, and recall decreases (i.e. search quality deteriorates). NOTE: especially phrase recall deteriorates significantly at higher threshold values. Primary purpose of this class is to produce small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader[]). Usually the performance of this class will not be sufficient to use the resulting index view for on-the-fly pruning and searching. NOTE: If the input index is optimized (i.e. doesn't contain deletions) then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. Threshold values can be specified globally (for terms in all fields) using defaultThreshold parameter, and can be overriden using per-field or per-term values supplied in a thresholds map. Keys in this map are either field names, or terms in field:text format. The precedence of these values is the following: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold. A command-line tool (PruningTool) is provided for convenience. At this moment it doesn't support all functionality available through API. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3718) SamplingWrapperTest failure with certain test seed
[ https://issues.apache.org/jira/browse/LUCENE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191963#comment-13191963 ] Doron Cohen commented on LUCENE-3718: - failure consistently recreated with these parameters. It is most likely a test bug, but still annoying. Should also rename misspelled method - should be: testCountUsingSampling() SamplingWrapperTest failure with certain test seed -- Key: LUCENE-3718 URL: https://issues.apache.org/jira/browse/LUCENE-3718 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Fix For: 3.6, 4.0 Build: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/12231/ 1 tests failed. REGRESSION: org.apache.lucene.facet.search.SamplingWrapperTest.testCountUsingSamping Error Message: Results are not the same! Stack Trace: org.apache.lucene.facet.FacetTestBase$NotSameResultError: Results are not the same! at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:333) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:104) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:529) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) NOTE: reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=4a5994491f79fc80:-18509d134c89c159:-34f6ecbb32e930f7 -Dtests.multiplier=3 -Dargs=-Dfile.encoding=UTF-8 NOTE: test params are: codec=Lucene40: {$facets=PostingsFormat(name=MockRandom), $full_path$=PostingsFormat(name=MockSep), content=Pulsing40(freqCutoff=19 minBlockSize=65 maxBlockSize=209), $payloads$=PostingsFormat(name=Lucene40WithOrds)}, sim=RandomSimilarityProvider(queryNorm=true,coord=true): {$facets=LM Jelinek-Mercer(0.70), content=DFR I(n)B3(800.0)}, locale=bg, timezone=Asia/Manila -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts
[ https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189036#comment-13189036 ] Doron Cohen commented on LUCENE-3703: - Missed that test comment about no need for random directory. About the decRef dup code, yeah, that's what I meant, but okay. I think this is ready to commit. DirectoryTaxonomyReader.refresh misbehaves with ref counts -- Key: LUCENE-3703 URL: https://issues.apache.org/jira/browse/LUCENE-3703 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.6, 4.0 Attachments: LUCENE-3703.patch, LUCENE-3703.patch DirectoryTaxonomyReader uses the internal IndexReader in order to track its own reference counting. However, when you call refresh(), it reopens the internal IndexReader, and from that point, all previous reference counting gets lost (since the new IndexReader's refCount is 1). The solution is to track reference counting in DTR itself. I wrote a simple unit test which exposes the bug (will be attached with the patch shortly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3703) DirectoryTaxonomyReader.refresh misbehaves with ref counts
[ https://issues.apache.org/jira/browse/LUCENE-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188975#comment-13188975 ] Doron Cohen commented on LUCENE-3703: - Patch looks good, builds and passes for me, thanks for fixing this Shai. Few comments: * CHANGES: rephrase the e.g. part like this: (e.g. if application called incRef/decRef). * New test: ** LTC.newDirectory() instead of new RAMDirectory(). ** text messages in the asserts. * DTR: ** Would it be simpler to make close() synchronized (just like IR.close()) ** Would it - again - be simpler to keep maintaining the ref-counts in the internal IR and just, in refresh, decRef as needed in the old one and incRef accordingly in the new one? This way we continue to delegate that logic to IR, and do not duplicate it. ** Current patch removes the ensureOpen() check from getRefCount(). I think this is correct - in fact I needed that when debugging this. Perhaps should document about it in CHANGES entry. DirectoryTaxonomyReader.refresh misbehaves with ref counts -- Key: LUCENE-3703 URL: https://issues.apache.org/jira/browse/LUCENE-3703 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.6, 4.0 Attachments: LUCENE-3703.patch DirectoryTaxonomyReader uses the internal IndexReader in order to track its own reference counting. However, when you call refresh(), it reopens the internal IndexReader, and from that point, all previous reference counting gets lost (since the new IndexReader's refCount is 1). The solution is to track reference counting in DTR itself. I wrote a simple unit test which exposes the bug (will be attached with the patch shortly). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3635) Allow setting arbitrary objects on PerfRunData
[ https://issues.apache.org/jira/browse/LUCENE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172215#comment-13172215 ] Doron Cohen commented on LUCENE-3635: - Patch looks good. bq. I do not propose to move IR/IW/TR/TW etc. into that map. If however people think that we should, I can do that as well. I rather keep these ones explicit as they are now. bq. I wonder if we should have this Map require Closeable so that we can close the objects on PerfRunData.close() Closing would be convenient, but I think requiring to pass Closeable is too restrictive? Instead, you could add something like this to close(): {code} for (Object o : perfObjects.values()) { if (o instanceof Closeable) { IOUtils.close((Closeable) o); } } {code} This is done only once at the end, so instanceof is not a perf issue here. If we close like this, we also need to document it at setPerfObject(). I think, BTW, that PFD.close() is not called by the Benchmark, it has to be explicitly invoked by the user. Allow setting arbitrary objects on PerfRunData -- Key: LUCENE-3635 URL: https://issues.apache.org/jira/browse/LUCENE-3635 Project: Lucene - Java Issue Type: Improvement Components: modules/benchmark Reporter: Shai Erera Assignee: Shai Erera Priority: Minor Fix For: 3.6, 4.0 Attachments: LUCENE-3635.patch PerfRunData is used as the intermediary objects between PerfRunTasks. Just like we can set IndexReader/Writer on it, it will be good if it allows setting other arbitrary objects that are e.g. created by one task and used by another. A recent example is the enhancement to the benchmark package following the addition of the facet module. We had to add TaxoReader/Writer. The proposal is to add a HashMapString, Object that custom PerfTasks can set()/get(). I do not propose to move IR/IW/TR/TW etc. into that map. If however people think that we should, I can do that as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0
[ https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158306#comment-13158306 ] Doron Cohen commented on LUCENE-3604: - Fixed the 3x file in r1207018 - ordering the API Changes entries by their date (by svn log). Keeping open for fixing the Changes.html that already appears in the Web site. 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0 - Key: LUCENE-3604 URL: https://issues.apache.org/jira/browse/LUCENE-3604 Project: Lucene - Java Issue Type: Bug Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor There are two API Changes sections which is confusing when looking at the txt version of the file. The HTML expands only the first of the two, unless expand-all is clicked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3604) 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0
[ https://issues.apache.org/jira/browse/LUCENE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13159086#comment-13159086 ] Doron Cohen commented on LUCENE-3604: - bq. The new version will show up on the website once the periodic resync happens. [3.5-contrib-changes|http://lucene.apache.org/java/3_5_0/changes/Contrib-Changes.html#3.5.0.api_changes] now shows the correct API changes. Thanks Steven! 3x/lucene/contrib/CHANGES.txt has two API Changes subsections for 3.5.0 - Key: LUCENE-3604 URL: https://issues.apache.org/jira/browse/LUCENE-3604 Project: Lucene - Java Issue Type: Bug Reporter: Doron Cohen Assignee: Steven Rowe Priority: Minor Fix For: 3.5 There are two API Changes sections which is confusing when looking at the txt version of the file. The HTML expands only the first of the two, unless expand-all is clicked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream
[ https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157717#comment-13157717 ] Doron Cohen commented on LUCENE-3596: - Also, there seems to be a bug in current taxonomy writer test - TestIndexClose - where the IndexWriterConfig's merge policy might allow to merge segments out-of-order. That test calls LTC.newIndexWriterConfig() and it is just by luck that this test have not failed so far. This is a bad type of failure for an application (is there ever a good type?;)), because by the time the bug is exposed it would show as a wrong facet returned in faceted search, and go figure that late that this is because at an earlier time an index writer was created which allowed out-of-order merging... Therefore, it would have been useful if, in addition to the javadocs about requiring type of merge policy, we would also throw an exception (IllegalArgument or IO) if the IWC's merge policy allows merging out-of-order. This should be checked in two locations: - createIWC() returns - openIndex() returns, by examining the IWC of the index The second check is more involved as it is done after the index was already opened, so it must be closed prior to throwing that exception. However, merge-policy does not have in its contract anything like Collector.acceptsDocsOutOfOrder(), so it is not possible to verify this at all. Adding such a method to MergePolicy seems to me an over-kill, for this particular case, unless there is additional interest in such a declaration? Otherwise, it is possible to require that the merge policy must be a descendant of LogMergePolicy. This on the other hand would not allow to test this class with other order-preserving policies, such as NoMerge. So I am not sure what is the best way to proceed in this regard. I think there are two options actually: # just javadoc that fact, and fix the test to always create an order preserving MP. # add that declaration to MP. Unless there are opinions favoring the second option I'll go with the first one. In addition, (this is true for both options) I will move the call to createIWC into the constructor and modify openIndex signature to accept an IWC instead of the open mode, as it seems wrong - API wise - that one extension point (createIWC) is invoked by another extension point (openIndex) - better have them both be invoked from the constructor, making it harder for someone to, by mistake, totally ignore in createIndex() the value returned by createIWC(). DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream Key: LUCENE-3596 URL: https://issues.apache.org/jira/browse/LUCENE-3596 Project: Lucene - Java Issue Type: Improvement Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3596.patch Current protected openIndexWriter(Directory directory, OpenMode openMode) does not provide access to the IWC it creates. So extensions must reimplement this method completely in order to set e.f. info stream for the internal index writer. This came up in [user question: Taxonomy indexer debug |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3596) DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream
[ https://issues.apache.org/jira/browse/LUCENE-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157605#comment-13157605 ] Doron Cohen commented on LUCENE-3596: - bq. and getIWC (if you intend to add it). Yes that's what I would like to add. These docs are missing then anyhow, with or without getIWC(). This added extendability is useful although behavior regarding info-stream differs between trunk and 3x - i.e. that in 3x one can set that stream also with current extension point. DirectoryTaxonomyWriter extensions should be able to set internal index writer config attributes such as info stream Key: LUCENE-3596 URL: https://issues.apache.org/jira/browse/LUCENE-3596 Project: Lucene - Java Issue Type: Improvement Components: modules/facet Reporter: Doron Cohen Priority: Minor Current protected openIndexWriter(Directory directory, OpenMode openMode) does not provide access to the IWC it creates. So extensions must reimplement this method completely in order to set e.f. info stream for the internal index writer. This came up in [user question: Taxonomy indexer debug |http://lucene.472066.n3.nabble.com/Taxonomy-indexer-debug-td3533341.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3588) Try harder to prevent SIGSEGV on cloned MMapIndexInputs
[ https://issues.apache.org/jira/browse/LUCENE-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155897#comment-13155897 ] Doron Cohen commented on LUCENE-3588: - Patch (last one) works well for me - the new test fails without the fix and passes with the fix. It relies on shallow cloning of 'clones' - and so would break if WHM starts to implement Cloneable for some reason, but then the 'assert clone.clones == this.clones' in clone() guarantees early detection of this in the tests, cool. Try harder to prevent SIGSEGV on cloned MMapIndexInputs --- Key: LUCENE-3588 URL: https://issues.apache.org/jira/browse/LUCENE-3588 Project: Lucene - Java Issue Type: Improvement Components: core/store Affects Versions: 3.4, 3.5 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.6, 4.0 Attachments: LUCENE-3588-simpler.patch, LUCENE-3588-simpler.patch, LUCENE-3588-simpler.patch, LUCENE-3588.patch, LUCENE-3588.patch, LUCENE-3588.patch We are unmapping mmapped byte buffers which is disallowed by the JDK, because it has the risk of SIGSEGV when you access the mapped byte buffer after unmapping. We currently prevent this for the main IndexInput by setting its buffer to null, so we NPE if somebody tries to access the underlying buffer. I recently fixed also the stupid curBuf (LUCENE-3200) by setting to null. The big problem are cloned IndexInputs which are generally not closed. Those still contain references to the unmapped ByteBuffer, which lead to SIGSEGV easily. The patch from Mike in LUCENE-3439 prevents most of this in Lucene 3.5, but its still not 100% safe (as it uses non-volatiles). This patch will fix the remaining issues by also setting the buffers of clones to null when the original is closed. The trick is to record weak references of all clones created and close them together with the original. This uses a ConcurrentHashMapWeakReferenceMMapIndexInput,? as store with the logic borrowed from WeakHashMap to cleanup the GCed references (using ReferenceQueue). If we respin 3.5, we should maybe also get this in. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151188#comment-13151188 ] Doron Cohen commented on LUCENE-3573: - Hmm, now that there is a test for LTW.rollback() my changes fail LTW's testRollback() because LTW.close() now may call IW.commit(Map) (which it did not do before my changes). For fixing this: - added private doClose() which closes IW and nullifies it, and calls closeResources(). - rollback() calls doClose() instead of close(). Also, rollback() now made synchronized. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch, LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149684#comment-13149684 ] Doron Cohen commented on LUCENE-3573: - I agree about keeping the same notions as IR. bq. returns null (no changes, or the taxonomy wasn't recreated) In fact I was thinking of a different contract. So we have two approaches here for the returned value: * Option A: ## *new TR* - if the taxonomy was recreated. ## *null* - if the taxonomy was either not modified or just grew. * Option B: ## *new TR* - if the taxonomy was modified (either recreated or just grew) ## *null* - if the taxonomy was not modified. Option A is simpler to implement, but I think it has two drawbacks: * it is confusingly different from that of IR * the fact that the TR was refreshed is hidden from the caller. Option B is a bit more involved to implement: * would need to copy arrays' data from old TR to new one in case the taxonomy only grew I started to implement option B but now rethinking this... bq. Was there any reason to add it to TestTaxonomyCombined? Good point, should probably move this to TestDirectoryTaxonomyReader. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149687#comment-13149687 ] Doron Cohen commented on LUCENE-3573: - One more thing - In approach B, the fact that the taxonomy just grew simply allows an optimization (read only the new ordinals), and so it is not a part of the API logic, and the only logic is - was the taxonomy modified or not. - In approach A, this fact is part of the API logic. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3573) TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern
[ https://issues.apache.org/jira/browse/LUCENE-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150279#comment-13150279 ] Doron Cohen commented on LUCENE-3573: - bq. So in fact, let's not call it openIfChanged, because may not be meaningful. Yes this bothered me too. bq. so maybe refreshIfChanged? ... let's stick to refresh() (but...) Current refresh impl is efficient in that (1) arrays only grow if needed (2) caches only cleaned from wrong 'invalid ordinals'. In that, it relies on that the taxonomy can only grow (unless it is recreated, hence this issue). So I now think it would be best to modify slightly refresh() - in case it detects that the taxonomy was created, it will throw a new (checked) exception - telling the application that this TR cannot be refreshed but the app can open a new TR. This way there is no 3-way logic - either the TR was refreshed or it was not. And while we are at this, refresh() is void. I think it would be useful to return boolean, indicating whether any refresh took place. TaxonomyReader.refresh() is broken, replace its logic with reopen(), following IR.reopen pattern Key: LUCENE-3573 URL: https://issues.apache.org/jira/browse/LUCENE-3573 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3573.patch When recreating the taxonomy index, TR's assumption that categories are only added does not hold anymore. As result, calling TR.refresh() will be incorrect at best, but usually throw an AIOOBE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3564) rename IndexWriter.rollback to .rollbackAndClose
[ https://issues.apache.org/jira/browse/LUCENE-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13145045#comment-13145045 ] Doron Cohen commented on LUCENE-3564: - My personal preference for this API is the current simple and short name *rollback()*. rename IndexWriter.rollback to .rollbackAndClose Key: LUCENE-3564 URL: https://issues.apache.org/jira/browse/LUCENE-3564 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.5, 4.0 Spinoff from LUCENE-3454, where Shai noticed that rollback is trappy since it [unexpected] closes the IW. I think we should rename it to rollbackAndClose. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3454) rename optimize to a less cool-sounding name
[ https://issues.apache.org/jira/browse/LUCENE-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13145047#comment-13145047 ] Doron Cohen commented on LUCENE-3454: - bq. Perhaps I am the only one, but I find these ifNeeded, mabyeThis, mabyeThat method names so ugly. I prefer JavaDoc for trying to catch the subtleties. I feel that way too. But a name change here seems in place, because as pointed above, there is an issue with current catchy name *optimize()*. My personal preference between the names suggested above is Mike's last one: *forceMerge(int)*: - it describes what's done - does not suggest to do wonders - requires caller to think twice because of deciding to force a certain behavior rename optimize to a less cool-sounding name Key: LUCENE-3454 URL: https://issues.apache.org/jira/browse/LUCENE-3454 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.4, 4.0 Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-3454.patch I think users see the name optimize and feel they must do this, because who wants a suboptimal system? but this probably just results in wasted time and resources. maybe rename to collapseSegments or something? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136823#comment-13136823 ] Doron Cohen commented on LUCENE-3506: - bq. I just committed this change to the IntelliJ IDEA configuration Thanks for fixing for IntelliJ! tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch, LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136826#comment-13136826 ] Doron Cohen commented on LUCENE-3506: - {quote} bq.Also, we've often done performance tests as unit tests in the past. Is there an easy way to disable this assertions enabled test? You can also enable assertions just for the class/package which checks if assertions are enabled, Yonik. This should make the check pass and disable all other assertions (for benchmarking). I don't remember the syntax off the top of my head though. {quote} Yonik, is this sufficient for running the perf tests? Otherwise I can add a -D flag for disabling testing this in LTC. tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch, LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136915#comment-13136915 ] Doron Cohen commented on LUCENE-3506: - For easier perf testing I added a -D flag to tell LTC not to fail each and every test if Java assertions are not enabled: {noformat} -Dtests.asserts.gracious=true {noformat} (Tests requiring Java assertions - e.g. TestAssertions - will still fail, on purpose.) - r1189655 - trunk - r1189663 - 3x tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch, LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3506) tests for verifying that assertions are enabled do nothing since they ignore AssertionError
[ https://issues.apache.org/jira/browse/LUCENE-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135086#comment-13135086 ] Doron Cohen commented on LUCENE-3506: - bq. (Whereas today if you run that test w/o assertions you get a failure, albeit a confusing one). Actually today when you run the tests - with assertions, without assertions, - you get no failures at all - which is what I was trying to fix here (unless I missed something seriously) - because: - the original tests, after deciding to fail, invoked fail() - this threw AssertionError - but it was ignored as part of their wrong logic. bq. I'm confused here – the changes to TestSegmentMerger look like they'll allow the test to pass when assertions are disabled? Right, I fixed it such that *only if* assertions are enabled, they verify that the expected assertion errors are not thrown, so they allow you to run tests also without enabling assertions. See my comment above only one test I take it that this kind of flexibility is not required. So will change it so that these tests will fail if assertions are not enabled. bq. The other day I committed an accidental change to common-build that disabled assertions, and it was a little confusing to track down. I see, so we make the entire test framework to fail if assertions are not enabled. I'll update the patch. tests for verifying that assertions are enabled do nothing since they ignore AssertionError --- Key: LUCENE-3506 URL: https://issues.apache.org/jira/browse/LUCENE-3506 Project: Lucene - Java Issue Type: Bug Components: general/test Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3506.patch Follow-up from LUCENE-3501 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)
[ https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124997#comment-13124997 ] Doron Cohen commented on LUCENE-3501: - Fixed in trunk: r1181760 Shai's comment on catching AssertionError made me search for other cases of catching this error in Lucene. Few such cases exist, and they all seem wrong, as they call fail when failing fail :) due to assert not enabled but fail to detect that failure since they then silently ignore AssertionError thrown by fail(). Opened LUCENE-3506 for this. random sampler is not random (and so facet SamplingWrapperTest occasionally fails) -- Key: LUCENE-3501 URL: https://issues.apache.org/jira/browse/LUCENE-3501 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3501.patch RandomSample is not random at all: It does not even import java.util.Random, and its behavior is deterministic. in addition, the test testCountUsingSamping() never retries as it was supposed to (for taking care of the hoped-for randomness). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)
[ https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123703#comment-13123703 ] Doron Cohen commented on LUCENE-3501: - The error (from Jenkins) was: {noformat} junit.framework.AssertionFailedError: Results are not the same! at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.lucene.facet.FacetTestBase.assertSameResults(FacetTestBase.java:316) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.assertSampling(BaseSampleTestTopK.java:93) at org.apache.lucene.facet.search.sampling.BaseSampleTestTopK.testCountUsingSamping(BaseSampleTestTopK.java:76) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:610) reproduce with: ant test -Dtestcase=SamplingWrapperTest -Dtestmethod=testCountUsingSamping -Dtests.seed=39c6b88dcada2192:-cf936a4278714b1:-770b2814b4a6acd7 {noformat} random sampler is not random (and so facet SamplingWrapperTest occasionally fails) -- Key: LUCENE-3501 URL: https://issues.apache.org/jira/browse/LUCENE-3501 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor RandomSample is not random at all: It does not even import java.util.Random, and its behavior is deterministic. in addition, the test testCountUsingSamping() never retries as it was supposed to (for taking care of the hoped-for randomness). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123714#comment-13123714 ] Doron Cohen commented on LUCENE-3262: - bq. I reduced those to 1-20 per document with depth of 1-3 and got results I could live with. I agree, tried this too now and the comparison is more reasonable. Perhaps what are reasonable numbers (for #facets/doc and their depth) is debatable, but I agree that 200 facets per document is too many. Changing the defaults to 20/3 and preparing to commit. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123735#comment-13123735 ] Doron Cohen commented on LUCENE-3262: - Committed to 3x in r1180637, thanks Gilad! Now porting to trunk, it is more involved than anticipated, because of contrib/modules differences. Managed to make the tests pass, and the benchmark alg of choice to run. However I noticed that in 3x that alg - when indexing reuters - added the entire collection, that is 21578 docs, while in trunk it only added about 400 docs. Might be something in my set-up, digging... Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3501) random sampler is not random (and so facet SamplingWrapperTest occasionally fails)
[ https://issues.apache.org/jira/browse/LUCENE-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123744#comment-13123744 ] Doron Cohen commented on LUCENE-3501: - Thanks for reviewing Shai! I'll change as you propose (confirming your understanding) and commit tomorrow. random sampler is not random (and so facet SamplingWrapperTest occasionally fails) -- Key: LUCENE-3501 URL: https://issues.apache.org/jira/browse/LUCENE-3501 Project: Lucene - Java Issue Type: Bug Components: modules/facet Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Attachments: LUCENE-3501.patch RandomSample is not random at all: It does not even import java.util.Random, and its behavior is deterministic. in addition, the test testCountUsingSamping() never retries as it was supposed to (for taking care of the hoped-for randomness). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122580#comment-13122580 ] Doron Cohen commented on LUCENE-3262: - bq. changes entry Right, I always forget to include it in the patch, and add it only afterwords, should change that... Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122598#comment-13122598 ] Doron Cohen commented on LUCENE-3262: - Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13123134#comment-13123134 ] Doron Cohen commented on LUCENE-3262: - bq. Someone can use AddFacetedDocTask w/ and w/o facets? What for? It is useful for specifying the property like this: {code} with.facets=facets:true:false ... { MAddDocs AddFacetedDoc : 400 {code} and then getting in the report something like this: {noformat} Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42) Operationround facets runCnt recsPerRunrec/s elapsedSec MAddDocs_400 0 true1 400 246.611.62 MAddDocs_400 - 1 false - - 1 - - - 400 - 1,801.80 - - 0.22 MAddDocs_400 2 true1 400 412.800.97 MAddDocs_400 - 3 false - - 1 - - - 400 - 2,139.04 - - 0.19 {noformat} Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3262) Facet benchmarking
[ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120003#comment-13120003 ] Doron Cohen commented on LUCENE-3262: - I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646. Currently the direction is: - Add to PerfRunData: -- Taxonomy Directory -- Taxonomy Writer -- Taxonomy Reader - Add tasks for manipulating facets and taxonomies: -- create/open/commit/close Taxonomy Index -- open/close Taxonomy Reader -- AddDocWith facets - FacetDocMaker will also build the categories into the document - FacetSource will bring back categories to be added to current doc - ReadTask will be extended to also support faceted search. This is different from the Solr benchmark approach, where a SolrSearchTask is not extending ReadTask but rather extending PerfTask. Not sure yet if this is the way to go - still work to be done here. Should have a start patch in a day or two. Facet benchmarking -- Key: LUCENE-3262 URL: https://issues.apache.org/jira/browse/LUCENE-3262 Project: Lucene - Java Issue Type: New Feature Components: modules/benchmark, modules/facet Reporter: Shai Erera Assignee: Doron Cohen Attachments: CorpusGenerator.java, TestPerformanceHack.java A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.). Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point. We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well. We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org