Problems installing Pylucene on Ubuntu 12.04
Hello, I'm facing problems with installing Pylucene on an Ubuntu 12.04 Server (32bit). Perhaps someone can give me some helpful advice? I've followed the official installation instructions [1]. It seems that building and installing JCC works fine. Also, running make to build Pylucene seems to succeed. But if I run make test, I get the errors attached below. Thank you in advance! Uwe 1: http://lucene.apache.org/pylucene/install.html Output of make test (shortened): [...] == ERROR: test_FieldEnumeration (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 236, in test_FieldEnumeration self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_IncrementalLoop (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File test/test_PythonDirectory.py, line 268, in test_IncrementalLoop self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_getFieldInfos (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 282, in test_getFieldInfos self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocument (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocumentWithText (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 112, in test_indexDocumentWithText self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocumentWithUnicodeText (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 143, in test_indexDocumentWithUnicodeText self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch I have a patch ready that implements the random sampling usign override on {{.getMatchingDocs()}}. It passes the test, so it should be ok :). It is slower however (only 3x speedup), but maybe there is room for optimization? Exact :168 ms Sampled:55 ms Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922234#comment-13922234 ] Shai Erera commented on LUCENE-5476: Thanks Rob. Few comments: * I don't think that we need both totalHits and segmentHits. I know it may look not so expensive, but if you think about these ++ for millions of hits, they add up. Instead, I think we should stick w/ the original totalHits per-segment and in RandomSamplingFC do a first pass to sum up the global totalHits. ** With that I think no changes are needed to FacetsCollector? About RandomSamplingFacetsCollector: * I think we should fix the class jdoc's last Note as follows: Note: if you use a counting {\@link Facets} implementation, you can fix the sampled counts by calling * Also, I think instead of correctFacetCounts we should call it amortizeFacetCounts or something like that. We do not implement here the exact facet counting method that was before. * I see that you remove sampleRatio, and now the ratio is computed as threshold/totalDocs but I think that's ... wrong? I.e. if threshold=10 and totalHits = 1000, I'll still get only 10 documents. But this is not what threshold means. ** I think we should have minSampleSize, below which we don't sample at all (that's the _threshold_) ** sampleRatio (e.g. 1%) is used only if totalHits minSampleSize, and even then, we make sure that we sample at least minSampleSize ** If we will have maxSampleSize as well, we can take that into account too, but it's OK if we don't do this in this issue * createSample seems to be memory-less -- i.e. it doesn't carry over the bin residue to the next segment. Not sure if it's critical though, but it might affect the total sample size. If you feel like getting closer to the optimum, want to fix it? Otherwise, can you please drop a TODO? * Also, do you want to test using WAH8DocIdSet instead of FixedBitSet for the sampled docs? If not, could you please drop a TODO that we could use a more efficient bitset impl since it's a sparse vector? About the test: * Could you remove the 's' from collectors in the test name? * Could you move to numDocs being a random number -- something like atLeast(8000)? * I don't mean to nitpick but if you obtain an NRT reader, no need to commit() :) * Make the two collector instances take 100/10% of the numDocs when you fix it * Maybe use random.nextInt(10) for the facets instead of alternating sequentially? * I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different? ** In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef? * You have some sops left at the end of the test Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-5493: --- Assignee: Robert Muir Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
Robert Muir created LUCENE-5493: --- Summary: Rename Sorter, NumericDocValuesSorter, and fix javadocs Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922280#comment-13922280 ] Uwe Schindler commented on LUCENE-5493: --- I agree. Intially when I read the mail on the mailing list I was confused about what the user was doing! I had no idea, that he tried to mix IndexSorter's APIs with a custom collector, which is likely to fail. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922283#comment-13922283 ] Robert Muir commented on LUCENE-5493: - If you look at the javadocs for all of lucene and then you read the description of these classes, you can see how the user was easily confused. Because of the problems here, my first plan of attack will be to remove these classes from public view completely. If i cannot do that, I will rename them and add warnings. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922293#comment-13922293 ] Rob Audenaerde commented on LUCENE-5476: Thanks Shai, I have fixed the points you noted about the collector. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater. I have a general question about your remarks about the test, besides fixing the obvious (names, commit, sops). Is there a reason to add more randomness to one test? I normally try to test one aspect in a unit test. And if I also want to test some other aspect, like random document counts (to test the sampleratio for example), I add more tests. {quote} Make the two collector instances take 100/10% of the numDocs when you fix it {quote} Sorry, I don't get what you mean by this. {quote} I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different? In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef? {quote} There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that {{0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193}}, so that is probable not going to happen :). Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922311#comment-13922311 ] Vishmi Money edited comment on LUCENE-5422 at 3/6/14 11:03 AM: --- Dmitry Kan , Otis Gospodnetic , Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE-4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE-3.0. As LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE-4.0 or above? was (Author: vishmi money): Dmitry Kan, Otis Gospodnetic, Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE-4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE-3.0. As LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE-4.0 or above? Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be the
[jira] [Commented] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922311#comment-13922311 ] Vishmi Money commented on LUCENE-5422: -- Dmitry Kan, Otis Gospodnetic, Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE-4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE-3.0. As LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE-4.0 or above? Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be the hard part), and you are right, another solution might work better. but for the reverse wildcard and synonyms situation, it seems we could even detect it on write if we created some hash of the previous terms postings. if the hash matches for the current term, we know it might be a duplicate and would have to actually do the costly check they are the same. maybe there are better ways to do it, but it might be a fun postingformat experiment to try. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922311#comment-13922311 ] Vishmi Money edited comment on LUCENE-5422 at 3/6/14 11:06 AM: --- Dmitry Kan , Otis Gospodnetic , Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE- 4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE- 3.0. As LUCENE- 4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE- 4.0 or above? was (Author: vishmi money): Dmitry Kan , Otis Gospodnetic , Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE-4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE-3.0. As LUCENE-4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE-4.0 or above? Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be
[jira] [Comment Edited] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922311#comment-13922311 ] Vishmi Money edited comment on LUCENE-5422 at 3/6/14 11:16 AM: --- [~dmitry_key], [~otis], Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE- 4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE- 3.0. As LUCENE- 4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE- 4.0 or above? was (Author: vishmi money): Dmitry Kan , Otis Gospodnetic , Thank you very much for your explanations and now I got a clear idea about the two issues. As new documents are added segments are merged to the index but, if some documents are deleted, we have to keep track on those using skip entries. Meanwhile we have to preserve or improve the performance of the operation. That is the area which is discussed in LUCENE-2082. In LUCENE-5422, we want to make synonyms, exact/inexact terms point to a same posting list also providing wildcard support. Main objective is to save space. Meanwhile, we also have to avoid index bloating much as possible. LUCENE-5422 relates with LUCENE-2082 because anyway LUCENE-5422 has to deal with segment merging. This is the idea I got and please let me know if I am wrong on something. Currently I am following LUCENE- 4.7.0 documentation and also being familiar with the source code and coding conventions. I also follow Michael McCandless's Blog and read few posts related like, Visualizing Lucene's segment merges, Building a new Lucene posting format etc. I also started reading LUCENE In Action-second edition book but then I noticed that it is for LUCENE- 3.0. As LUCENE- 4.0 has switched to a new pluggable codec architecture, I wonder whether all the content of the book is relavent or not. Shall I proceed with the reading or should I only have to look on documentation for LUCENE- 4.0 or above? Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be
Suggestions about writing / extending QueryParsers
Hi all, I'm thinking about writing/extending a QueryParser for MLT queries; I've never really looked into that code too much, while I'm doing that now, I'm wondering if anyone has suggestions on how to start with such a topic. Should I write a new grammar for that ? Or can I just extend an existing grammar / class? Thanks in advance, Tommaso
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922335#comment-13922335 ] Robert Muir commented on LUCENE-5493: - Upon review of the APIs, I think ideal to the user is to remove these current sorter/comparators so that when you want to use sorting mergepolicy, you just pass it a normal org.apache.lucene.search.Sort. I know it seems a little crazy, but IMO the logic is duplicated. So someone should just be doing: {code} Sort sort = new Sort(new SortField(field1, SortField.Type.DOUBLE), new SortField()); iwc.setMergePolicy(mp, new SortingMergePolicy(sort)); {code} This would let people be able to sort in reverse, by doubles/floats, by a combination of fields, expressions, whatever. And would deconfuse the API. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922354#comment-13922354 ] Michael McCandless commented on LUCENE-5493: That would be great, if we could just use Sort here! +1 to deconfuse the API. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922358#comment-13922358 ] Robert Muir commented on LUCENE-5493: - Another reason is that IndexSearcher knows about Sort already. so this would give us a path if we want a better integration here in the future. If we did it right, no additional info/apis from the user would be needed other than setting the mergePolicy at index-time: indexSearcher.search(query, filter, int, sort) for example could do the right thing for a segment, if the passed in query-time sort is covered by the sort order of the index. But thats for the future. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922374#comment-13922374 ] Gilad Barkai commented on LUCENE-5476: -- Hi Rob, patch looks great. A few comments: * Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector o.a.l.s.DocIdSet) * Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor could be made {{final}} * XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it was needed for usability when this was thought to be a core utility and was left by mistake? Should it be called somewhere? * {{getMatchingDocs()}} ** when {{!sampleNeeded()}} there's a call to {{super.getMatchingDocs()}}, this may be redundant method call as 5 lines above we call it, and the code always compute the {{totalHits}} first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results. ** {{totalHits}} is redundantly computed again in line 147-152 * {{needsSampling()}} could perhaps be protected, allowing other criteria for sampling to be added * {{createSample()}} ** {{randomIndex}} is initialized to {{0}}, effectively making the first document of every segment's bin to be selected as the representative of that bin, neglecting the rest of the bin (regardless of the seed). So if a bin is the size of a 1000 documents, than there are 999 documents that regardless of the seed would always be neglected. It may be better so initialize as {{randomIndex = random.nextInt(binsize)}} as it happens for the 2nd and on bins. ** While creating a new {{MatchingDocs}} with the sampled set, the original {{totalHits}} and original {{scores}} are used. I'm not 100% sure the first is an issue, but any facet accumulation which would rely on document scores would be hit by the second as the {{scores}} (at least by javadocs) are defined as non-sparse. Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922411#comment-13922411 ] Rob Audenaerde commented on LUCENE-5476: Thanks, {quote} when !sampleNeeded() there's a call to super.getMatchingDocs(), this may be redundant method call as 5 lines above we call it, and the code always compute the totalHits first. Perhaps the original matching docs could be stored as a member? This would also help for some implementations of correcting the sampled facet results. totalHits is redundantly computed again in line 147-152 {quote} How could I have missed this... Must take a break I think. {{createSample}} I always take the first document, as I did not implement carrying-over of the segments. If I would pick a random index and this index would be greater than the number of document in the segment, the segment would not be sampled. This results is 'too few' sampled documents. Taking the first always might result in 'too many' but that gave a better overall distribution and average. I think your argument about not-so-random documents and the fact that carry-over should not be that hard, I should implement carry over anyway. Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5265) Add backward compatibility tests to JavaBinCodec's format.
[ https://issues.apache.org/jira/browse/SOLR-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Thacker updated SOLR-5265: Attachment: SOLR-5265.patch An attempt at tackling this Jira - 1. The test will ensure that if we ever change the byte values of existing variables in JavaBinCodec or if we write a type differently it will fail 2. If new types are added to JavaBinCodec the test case and the binary file will have to be updated again There are a couple of nocommits but I wanted to know if I am on the right track. Add backward compatibility tests to JavaBinCodec's format. -- Key: SOLR-5265 URL: https://issues.apache.org/jira/browse/SOLR-5265 Project: Solr Issue Type: Test Reporter: Adrien Grand Priority: Blocker Fix For: 4.7 Attachments: SOLR-5265.patch Since Solr guarantees backward compatibility of JavaBinCodec's format between releases, we should have tests for it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Suggestions about writing / extending QueryParsers
Hi Tommaso, It will depend on how different your target syntax will be. If you extend the classic parser (or, QueryParserBase), there is a fair amount of overhead and extras that you might not want or need. On the other hand, the query syntax and the methods will be familiar to the Lucene community, and there is a large number of test cases already built for you. On the third hand, if you need not modify the low level parsing stuff, you'll have to be familiar with javacc. There's the flexible family that should allow for easy modifications, and the xml family could offer an easy interface between a custom lexer and a parser. The SimpleQueryParser offers a model of building something fairly simple and yet very elegant from scratch. In deciding where to start, another consideration might include how easy it will be to integrate at the Solr level. Make sure to include field-based hooks for processing multiterms, prefix and range queries. For LUCENE-5205, I eventually chose to subclass QueryParserBase, and I had to override a fair amount of code because every terminal had to be a SpanQuery - most of the queryparser infrastructure is built for traditional queries. So, what features do you want to add for mlt? What capabilities do you need? Cheers, Tim From: Tommaso Teofili [mailto:tommaso.teof...@gmail.com] Sent: Thursday, March 06, 2014 6:23 AM To: dev@lucene.apache.org Subject: Suggestions about writing / extending QueryParsers Hi all, I'm thinking about writing/extending a QueryParser for MLT queries; I've never really looked into that code too much, while I'm doing that now, I'm wondering if anyone has suggestions on how to start with such a topic. Should I write a new grammar for that ? Or can I just extend an existing grammar / class? Thanks in advance, Tommaso
[jira] [Updated] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-5493: Attachment: LUCENE-5493-poc.patch Here is a very simple proof of concept patch. I made a SortSorter (extends existing Sorter api and takes o.a.l.s.Sort), removed NumericDocValuesSorter, and replaced it with this more general Sort-Sorter in all tests and they pass. So my next step would be to remove public apis like Sorter/DocMap and make that all internal. SortingMP and EarlyTerminatingSortingCollector would just take Sort directly. BlockJoinSorter needs to be cutover to a regular comparator. And in suggest/ there is a custom comparator... that i think doesnt need to be custom and is just sorting on a dv field. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5265) Add backward compatibility tests to JavaBinCodec's format.
[ https://issues.apache.org/jira/browse/SOLR-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922452#comment-13922452 ] Noble Paul commented on SOLR-5265: -- don't use toString() to compare test actual values don't use FileInputstream, Use getClass().getResourceAsStream(/solrj/updateReq_4_5.bin) as given in TestUpdateRequestCodec add the rest of the types We also need a forward compatibility tests Add backward compatibility tests to JavaBinCodec's format. -- Key: SOLR-5265 URL: https://issues.apache.org/jira/browse/SOLR-5265 Project: Solr Issue Type: Test Reporter: Adrien Grand Priority: Blocker Fix For: 4.7 Attachments: SOLR-5265.patch Since Solr guarantees backward compatibility of JavaBinCodec's format between releases, we should have tests for it. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922468#comment-13922468 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574867 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574867 ] LUCENE-5493: commit current state Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922467#comment-13922467 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574866 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574866 ] LUCENE-5493: create branch Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922507#comment-13922507 ] Adrien Grand commented on LUCENE-5493: -- +1 on making those classes wrap a `Sort`. I had started working on it for LUCENE-5314 but never got a chance to get a patch ready. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5820) OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout.
Mark Miller created SOLR-5820: - Summary: OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. Key: SOLR-5820 URL: https://issues.apache.org/jira/browse/SOLR-5820 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5820) OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout.
[ https://issues.apache.org/jira/browse/SOLR-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922524#comment-13922524 ] Mark Miller commented on SOLR-5820: --- Test fails in creating collections led me to this. OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. -- Key: SOLR-5820 URL: https://issues.apache.org/jira/browse/SOLR-5820 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5820) OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout.
[ https://issues.apache.org/jira/browse/SOLR-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922533#comment-13922533 ] Mark Miller commented on SOLR-5820: --- This also ends up being a fairly ugly fail - the user basically ends up seeing that creating the collection failed because it already exists, because it retries. OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. -- Key: SOLR-5820 URL: https://issues.apache.org/jira/browse/SOLR-5820 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5820) OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout.
[ https://issues.apache.org/jira/browse/SOLR-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922540#comment-13922540 ] ASF subversion and git services commented on SOLR-5820: --- Commit 1574883 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1574883 ] SOLR-5820: OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. -- Key: SOLR-5820 URL: https://issues.apache.org/jira/browse/SOLR-5820 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Suggestions about writing / extending QueryParsers
Tommaso, Do say more about what you're thinking of. I'm currently getting my dev environment up to look into enhancing the MoreLikeThisHandler to be able handle function query boosts. This should be eminently possible from my initial research. However, if you're thinking of something more powerful, perhaps we can work together. Upayavira On Thu, Mar 6, 2014, at 11:23 AM, Tommaso Teofili wrote: Hi all, I'm thinking about writing/extending a QueryParser for MLT queries; I've never really looked into that code too much, while I'm doing that now, I'm wondering if anyone has suggestions on how to start with such a topic. Should I write a new grammar for that ? Or can I just extend an existing grammar / class? Thanks in advance, Tommaso
[jira] [Commented] (SOLR-5820) OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout.
[ https://issues.apache.org/jira/browse/SOLR-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922544#comment-13922544 ] ASF subversion and git services commented on SOLR-5820: --- Commit 1574884 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574884 ] SOLR-5820: OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. OverseerCollectionProcessor#lookupReplicas has a timeout that is too short and a bad error message on timeout. -- Key: SOLR-5820 URL: https://issues.apache.org/jira/browse/SOLR-5820 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922553#comment-13922553 ] Otis Gospodnetic commented on LUCENE-5422: -- Maybe [~mikemccand] can comment, but I think think you are right as far as Codecs part of Lucene and LIA are concerned. Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be the hard part), and you are right, another solution might work better. but for the reverse wildcard and synonyms situation, it seems we could even detect it on write if we created some hash of the previous terms postings. if the hash matches for the current term, we know it might be a duplicate and would have to actually do the costly check they are the same. maybe there are better ways to do it, but it might be a fun postingformat experiment to try. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922620#comment-13922620 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574909 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574909 ] LUCENE-5493: make BlockJoinSorter a ComparatorSource taking parent/child Sort Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5422) Postings lists deduplication
[ https://issues.apache.org/jira/browse/LUCENE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922627#comment-13922627 ] Michael McCandless commented on LUCENE-5422: I think reading blogs, javadocs, CHANGES entries, are all good ways to come up to speed on the recent changes in Lucene. And, yes, LUCENE-2082 is about more efficient merging, by appending raw postings bytes instead of decode + re-encode that's done today. It's analogous to how Lucene used to fully decode and then re-encode each Document (stored fields) during merging, but today we just do bulk copying of bytes when possible (same for term vectors). I think this issue needs better scoping / maybe a clearer use case, to understand exactly when the postings list deduping should kick in. And if this incurs a search-time cost (e.g. a merge sort of N postings lists to make it look like a single posting list) that's an added cost that may be the wrong (smaller index and slower searching) tradeoff in most cases. Postings lists deduplication Key: LUCENE-5422 URL: https://issues.apache.org/jira/browse/LUCENE-5422 Project: Lucene - Core Issue Type: Improvement Components: core/codecs, core/index Reporter: Dmitry Kan Labels: gsoc2014 The context: http://markmail.org/thread/tywtrjjcfdbzww6f Robert Muir and I have discussed what Robert eventually named postings lists deduplication at Berlin Buzzwords 2013 conference. The idea is to allow multiple terms to point to the same postings list to save space. This can be achieved by new index codec implementation, but this jira is open to other ideas as well. The application / impact of this is positive for synonyms, exact / inexact terms, leading wildcard support via storing reversed term etc. For example, at the moment, when supporting exact (unstemmed) and inexact (stemmed) searches, we store both unstemmed and stemmed variant of a word form and that leads to index bloating. That is why we had to remove the leading wildcard support via reversing a token on index and query time because of the same index size considerations. Comment from Mike McCandless: Neat idea! Would this idea allow a single term to point to (the union of) N other posting lists? It seems like that's necessary e.g. to handle the exact/inexact case. And then, to produce the Docs/AndPositionsEnum you'd need to do the merge sort across those N posting lists? Such a thing might also be do-able as runtime only wrapper around the postings API (FieldsProducer), if you could at runtime do the reverse expansion (e.g. stem - all of its surface forms). Comment from Robert Muir: I think the exact/inexact is trickier (detecting it would be the hard part), and you are right, another solution might work better. but for the reverse wildcard and synonyms situation, it seems we could even detect it on write if we created some hash of the previous terms postings. if the hash matches for the current term, we know it might be a duplicate and would have to actually do the costly check they are the same. maybe there are better ways to do it, but it might be a fun postingformat experiment to try. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922641#comment-13922641 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574918 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574918 ] LUCENE-5493: hide Sorter, SortSorter, fix tests, change suggest to use public Sort API, cut over collector to take Sort Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922646#comment-13922646 ] Tim Smith commented on LUCENE-5492: --- Narrowing it down definitely seeing a reference count issue this only seems to occur when using DirectoryReader.open(IndexWriter ...) methods for one particular commit point segments_4, i see the following refcount behavior: * incref segments_4 ** incref _0_upgraded.si refcount=3 ** decref _0_upgraded.si refcount=2 * incref segments_4 ** NOTE: _0_upgraded.si not incref'd this time * ... * delete segments_4 ** decref _0_upgraded.si ERROR IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922656#comment-13922656 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574925 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574925 ] LUCENE-5493: remove dead code Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922658#comment-13922658 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574926 from [~mikemccand] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574926 ] LUCENE-5493: small clean ups Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5637) Per-request cache statistics
[ https://issues.apache.org/jira/browse/SOLR-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shikhar Bhushan updated SOLR-5637: -- Fix Version/s: (was: 4.7) Per-request cache statistics Key: SOLR-5637 URL: https://issues.apache.org/jira/browse/SOLR-5637 Project: Solr Issue Type: New Feature Reporter: Shikhar Bhushan Priority: Minor Attachments: SOLR-5367.patch, SOLR-5367.patch We have found it very useful to have information on the number of cache hits and misses for key Solr caches (filterCache, documentCache, etc.) at the request level. This is currently implemented in our codebase using custom {{SolrCache}} implementations. I am working on moving to maintaining stats in the {{SolrRequestInfo}} thread-local, and adding hooks in get() methods of SolrCache implementations. This will be glued up using the {{DebugComponent}} and can be requested using a debug.cache parameter. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5637) Per-request cache statistics
[ https://issues.apache.org/jira/browse/SOLR-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shikhar Bhushan updated SOLR-5637: -- Attachment: SOLR-5637.patch updated patch against lucene_solr_4_7 branch Per-request cache statistics Key: SOLR-5637 URL: https://issues.apache.org/jira/browse/SOLR-5637 Project: Solr Issue Type: New Feature Reporter: Shikhar Bhushan Priority: Minor Attachments: SOLR-5367.patch, SOLR-5367.patch, SOLR-5637.patch We have found it very useful to have information on the number of cache hits and misses for key Solr caches (filterCache, documentCache, etc.) at the request level. This is currently implemented in our codebase using custom {{SolrCache}} implementations. I am working on moving to maintaining stats in the {{SolrRequestInfo}} thread-local, and adding hooks in get() methods of SolrCache implementations. This will be glued up using the {{DebugComponent}} and can be requested using a debug.cache parameter. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922663#comment-13922663 ] Michael McCandless commented on LUCENE-5492: Hmm I wonder if this is related to LUCENE-5434; we added incRef/decRef for NRT readers pulled from IndexWriter. If you revert that change locally do you still see this happening? IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Problems installing Pylucene on Ubuntu 12.04
On Thu, 6 Mar 2014, Ritzschke, Uwe wrote: Hello, I'm facing problems with installing Pylucene on an Ubuntu 12.04 Server (32bit). Perhaps someone can give me some helpful advice? I've followed the official installation instructions [1]. It seems that building and installing JCC works fine. Also, running make to build Pylucene seems to succeed. But if I run make test, I get the errors attached below. It looks like there is a left-over 'import pdb; pdb.set_trace()' statement in the test_PythonDirectory.py test, at line 260. Please, remove it and re-run the tests. Thanks ! Andi.. Thank you in advance! Uwe 1: http://lucene.apache.org/pylucene/install.html Output of make test (shortened): [...] == ERROR: test_FieldEnumeration (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 236, in test_FieldEnumeration self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_IncrementalLoop (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File test/test_PythonDirectory.py, line 268, in test_IncrementalLoop self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_getFieldInfos (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 282, in test_getFieldInfos self.test_indexDocument() File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocument (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 84, in test_indexDocument self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocumentWithText (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 112, in test_indexDocumentWithText self.closeStore(store, writer) File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File test/test_PythonDirectory.py, line 255, in closeStore for arg in args: File /usr/lib/python2.7/bdb.py, line 48, in trace_dispatch return self.dispatch_line(frame) File /usr/lib/python2.7/bdb.py, line 67, in dispatch_line if self.quitting: raise BdbQuit BdbQuit == ERROR: test_indexDocumentWithUnicodeText (__main__.PythonDirectoryTests) -- Traceback (most recent call last): File /root/pylucene-4.6.1-1/test/test_PyLucene.py, line 143, in test_indexDocumentWithUnicodeText
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922667#comment-13922667 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574928 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574928 ] LUCENE-5493: minor cleanups/opto Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5768) Add a distrib.singlePass parameter to make GET_FIELDS phase fetch all fields and skip EXECUTE_QUERY
[ https://issues.apache.org/jira/browse/SOLR-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922669#comment-13922669 ] Shikhar Bhushan commented on SOLR-5768: --- seems like the JIRA title has it the other way round :) Add a distrib.singlePass parameter to make GET_FIELDS phase fetch all fields and skip EXECUTE_QUERY --- Key: SOLR-5768 URL: https://issues.apache.org/jira/browse/SOLR-5768 Project: Solr Issue Type: Improvement Reporter: Shalin Shekhar Mangar Priority: Minor Fix For: 4.8, 5.0 Suggested by Yonik on solr-user: http://www.mail-archive.com/solr-user@lucene.apache.org/msg95045.html {quote} Although it seems like it should be relatively simple to make it work with other fields as well, by passing down the complete fl requested if some optional parameter is set (distrib.singlePass?) {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5494) ArrayIndexOutOfBounds - WordBreakSolrSpellChecker.java:266
Mark Peck created LUCENE-5494: -- Summary: ArrayIndexOutOfBounds - WordBreakSolrSpellChecker.java:266 Key: LUCENE-5494 URL: https://issues.apache.org/jira/browse/LUCENE-5494 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.3 Environment: SOlrNet, Uri interface Reporter: Mark Peck Priority: Minor When running the following query: {code} http://localhost:8983/solr/search/select?q=(%22active%2Bhuman%2Bcox-2%22+OR+(%22active%22+AND+%22human%22+AND+%22cox-2%22))spellcheck=true {code} We get the following error output: {code:xml} lst name=error str name=msg9/str str name=trace java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:266) at org.apache.solr.spelling.ConjunctionSolrSpellChecker.getSuggestions(ConjunctionSolrSpellChecker.java:120) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) /str int name=code500/int /lst {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2894) Implement distributed pivot faceting
[ https://issues.apache.org/jira/browse/SOLR-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922686#comment-13922686 ] Brett Lucey commented on SOLR-2894: --- Elran - A facet limit of -1 in distributed pivot facets is not a use case we use in our environment, but we did go ahead and make the fixes in order to support the community. I've tested the changes locally on a box with success and added unit tests around it, but we have not yet deployed those changes to a production cluster. The exception you were seeing was directly related to the facet limit being negative, and that has been fixed in the patch I uploaded yesterday. Implement distributed pivot faceting Key: SOLR-2894 URL: https://issues.apache.org/jira/browse/SOLR-2894 Project: Solr Issue Type: Improvement Reporter: Erik Hatcher Fix For: 4.7 Attachments: SOLR-2894-reworked.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch Following up on SOLR-792, pivot faceting currently only supports undistributed mode. Distributed pivot faceting needs to be implemented. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5628) Cloud test harness manifesting reproducible failures in TestDistribDocBasedVersion
[ https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922687#comment-13922687 ] ASF subversion and git services commented on SOLR-5628: --- Commit 1574941 from hoss...@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1574941 ] SOLR-5628: work arround for this test to avoid whatever bug is in the cloud test framework Cloud test harness manifesting reproducible failures in TestDistribDocBasedVersion -- Key: SOLR-5628 URL: https://issues.apache.org/jira/browse/SOLR-5628 Project: Solr Issue Type: Bug Reporter: Hoss Man Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch. McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed. Meanwhile, it appears that recent changes to 4x have caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads. details to following in comment, but here's the reproduce line... {noformat} ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=791402573DC76F3C -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ar_IQ -Dtests.timezone=Antarctica/Rothera -Dtests.file.encoding=US-ASCII {noformat} And the mail thread regarding this... https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5494) ArrayIndexOutOfBounds - WordBreakSolrSpellChecker.java:266
[ https://issues.apache.org/jira/browse/LUCENE-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Peck updated LUCENE-5494: --- Description: When running the following query: {code} http://localhost:8983/solr/search/select?q=(%22active%2Bhuman%2Bcox-2%22+OR+(%22active%22+AND+%22human%22+AND+%22cox-2%22))spellcheck=true {code} We get the following error output: {code:xml} lst name=error str name=msg9/str str name=trace java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:266) at org.apache.solr.spelling.ConjunctionSolrSpellChecker.getSuggestions(ConjunctionSolrSpellChecker.java:120) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) /str int name=code500/int /lst {code} (!) We have assertain this only happens when the '-2' as added to the search term. was: When running the following query: {code} http://localhost:8983/solr/search/select?q=(%22active%2Bhuman%2Bcox-2%22+OR+(%22active%22+AND+%22human%22+AND+%22cox-2%22))spellcheck=true {code} We get the following error output: {code:xml} lst name=error str name=msg9/str str name=trace java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:266) at org.apache.solr.spelling.ConjunctionSolrSpellChecker.getSuggestions(ConjunctionSolrSpellChecker.java:120) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at
[jira] [Updated] (LUCENE-5494) ArrayIndexOutOfBounds - WordBreakSolrSpellChecker.java:266
[ https://issues.apache.org/jira/browse/LUCENE-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Peck updated LUCENE-5494: --- Description: When running the following query: {code} http://localhost:8983/solr/search/select?q=(%22active%2Bhuman%2Bcox-2%22+OR+(%22active%22+AND+%22human%22+AND+%22cox-2%22))spellcheck=true {code} We get the following error output: {code:xml} lst name=error str name=msg9/str str name=trace java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:266) at org.apache.solr.spelling.ConjunctionSolrSpellChecker.getSuggestions(ConjunctionSolrSpellChecker.java:120) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) /str int name=code500/int /lst {code} (!) We have ascertained this only happens when the '-2' as added to the search term. was: When running the following query: {code} http://localhost:8983/solr/search/select?q=(%22active%2Bhuman%2Bcox-2%22+OR+(%22active%22+AND+%22human%22+AND+%22cox-2%22))spellcheck=true {code} We get the following error output: {code:xml} lst name=error str name=msg9/str str name=trace java.lang.ArrayIndexOutOfBoundsException: 9 at org.apache.solr.spelling.WordBreakSolrSpellChecker.getSuggestions(WordBreakSolrSpellChecker.java:266) at org.apache.solr.spelling.ConjunctionSolrSpellChecker.getSuggestions(ConjunctionSolrSpellChecker.java:120) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:172) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at
[jira] [Updated] (SOLR-5628) Cloud test harness can cause index files to be deleted before IndexWriter is closed
[ https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-5628: --- Description: This bug was orriginally opened because Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch. McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed. Meanwhile, the failures later reproduced in other seeds on both 4x and trunk -- and it appears that recent changes caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads. One interesting factor about this test is that at the end of the test there were docs that had been added that were not committed -- which is probably unusually for most tests, and may explain why more cloud tests aren't exhibiting similar symptoms more often. *When a useless (from perspective of what the test is trying to verify) commit was added to the test, the failing seed stoped reproducing.* An example of how to reliably reproduce this problem on an (older version of) trunk... {noformat} svn update -r 1574381 ant clean cd solr/core ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=1249227945045A2E -Dtests.slow=true -Dtests.locale=ko_KR -Dtests.timezone=America/Monterrey -Dtests.file.encoding=ISO-8859-1 {noformat} Original email thread... https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E was: Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch. McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed. Meanwhile, it appears that recent changes to 4x have caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads. details to following in comment, but here's the reproduce line... {noformat} ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=791402573DC76F3C -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=ar_IQ -Dtests.timezone=Antarctica/Rothera -Dtests.file.encoding=US-ASCII {noformat} And the mail thread regarding this... https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E Summary: Cloud test harness can cause index files to be deleted before IndexWriter is closed (was: Cloud test harness manifesting reproducible failures in TestDistribDocBasedVersion) I've edited the description to reflect the updated state of things, since i've been able to commit a work around to the original test that manifested the problem with the cloud test framework. Cloud test harness can cause index files to be deleted before IndexWriter is closed --- Key: SOLR-5628 URL: https://issues.apache.org/jira/browse/SOLR-5628 Project: Solr Issue Type: Bug Reporter: Hoss Man This bug was orriginally opened because Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch. McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed. Meanwhile, the failures later reproduced in other seeds on both 4x and trunk -- and it appears that recent changes caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads. One interesting factor about this test is that at the end of the test there were docs that had been added that were not committed -- which is probably unusually for most tests, and may explain why more cloud tests aren't exhibiting similar symptoms more often. *When a useless (from perspective of what the test is trying to verify) commit was added to the test, the failing seed stoped reproducing.* An example of how to reliably reproduce this problem on an (older version of) trunk... {noformat} svn update -r 1574381 ant clean cd solr/core ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=1249227945045A2E -Dtests.slow=true -Dtests.locale=ko_KR -Dtests.timezone=America/Monterrey
[jira] [Commented] (SOLR-5628) Cloud test harness can cause index files to be deleted before IndexWriter is closed
[ https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922697#comment-13922697 ] ASF subversion and git services commented on SOLR-5628: --- Commit 1574942 from hoss...@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574942 ] SOLR-5628: work arround for this test to avoid whatever bug is in the cloud test framework (merge r1574941) Cloud test harness can cause index files to be deleted before IndexWriter is closed --- Key: SOLR-5628 URL: https://issues.apache.org/jira/browse/SOLR-5628 Project: Solr Issue Type: Bug Reporter: Hoss Man This bug was orriginally opened because Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch. McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed. Meanwhile, the failures later reproduced in other seeds on both 4x and trunk -- and it appears that recent changes caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads. One interesting factor about this test is that at the end of the test there were docs that had been added that were not committed -- which is probably unusually for most tests, and may explain why more cloud tests aren't exhibiting similar symptoms more often. *When a useless (from perspective of what the test is trying to verify) commit was added to the test, the failing seed stoped reproducing.* An example of how to reliably reproduce this problem on an (older version of) trunk... {noformat} svn update -r 1574381 ant clean cd solr/core ant test -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=1249227945045A2E -Dtests.slow=true -Dtests.locale=ko_KR -Dtests.timezone=America/Monterrey -Dtests.file.encoding=ISO-8859-1 {noformat} Original email thread... https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922701#comment-13922701 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574945 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574945 ] LUCENE-5493: simplify this test Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922709#comment-13922709 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574949 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574949 ] LUCENE-5493: merge Sorter and SortSorter (in progress) Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5796) With many collections, leader re-election takes too long when a node dies or is rebooted, leading to some shards getting into a conflicting state about who is the lead
[ https://issues.apache.org/jira/browse/SOLR-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922711#comment-13922711 ] Timothy Potter commented on SOLR-5796: -- I'm still a little concern about a couple of things: 1) why is the leader state stored in two places in ZooKeeper (/clusterstate.json and /collections/coll/leaders/shard#)? I'm sure there is a good reason for this but can't see why ;-) 2) if the timeout still occurs (as we don't want to wait forever), can't the node with the conflict just favor what's in the leader path assuming that replica is active and agrees? In other words, instead of throwing an exception and then just ending up a down state, why can't the replica seeing the conflict just go with what ZooKeeper says? I'm digging into leader failover timing / error handling today. Thanks. Tim With many collections, leader re-election takes too long when a node dies or is rebooted, leading to some shards getting into a conflicting state about who is the leader. Key: SOLR-5796 URL: https://issues.apache.org/jira/browse/SOLR-5796 Project: Solr Issue Type: Bug Components: SolrCloud Environment: Found on branch_4x Reporter: Timothy Potter Assignee: Mark Miller Fix For: 4.8, 5.0 Attachments: SOLR-5796.patch I'm doing some testing with a 4-node SolrCloud cluster against the latest rev in branch_4x having many collections, 150 to be exact, each having 4 shards with rf=3, so 450 cores per node. Nodes are decent in terms of resources: -Xmx6g with 4 CPU - m3.xlarge's in EC2. The problem occurs when rebooting one of the nodes, say as part of a rolling restart of the cluster. If I kill one node and then wait for an extended period of time, such as 3 minutes, then all of the leaders on the downed node (roughly 150) have time to failover to another node in the cluster. When I restart the downed node, since leaders have all failed over successfully, the new node starts up and all cores assume the replica role in their respective shards. This is goodness and expected. However, if I don't wait long enough for the leader failover process to complete on the other nodes before restarting the downed node, then some bad things happen. Specifically, when the dust settles, many of the previous leaders on the node I restarted get stuck in the conflicting state seen in the ZkController, starting around line 852 in branch_4x: {quote} 852 while (!leaderUrl.equals(clusterStateLeaderUrl)) { 853 if (tries == 60) { 854 throw new SolrException(ErrorCode.SERVER_ERROR, 855 There is conflicting information about the leader of shard: 856 + cloudDesc.getShardId() + our state says: 857 + clusterStateLeaderUrl + but zookeeper says: + leaderUrl); 858 } 859 Thread.sleep(1000); 860 tries++; 861 clusterStateLeaderUrl = zkStateReader.getLeaderUrl(collection, shardId, 862 timeoutms); 863 leaderUrl = getLeaderProps(collection, cloudDesc.getShardId(), timeoutms) 864 .getCoreUrl(); 865 } {quote} As you can see, the code is trying to give a little time for this problem to work itself out, 1 minute to be exact. Unfortunately, that doesn't seem to be long enough for a busy cluster that has many collections. Now, one might argue that 450 cores per node is asking too much of Solr, however I think this points to a bigger issue of the fact that a node coming up isn't aware that it went down and leader election is running on other nodes and is just being slow. Moreover, once this problem occurs, it's not clear how to fix it besides shutting the node down again and waiting for leader failover to complete. It's also interesting to me that /clusterstate.json was updated by the healthy node taking over the leader role but the /collections/collleaders/shard# was not updated? I added some debugging and it seems like the overseer queue is extremely backed up with work. Maybe the solution here is to just wait longer but I also want to get some feedback from the community on other options? I know there are some plans to help scale the Overseer (i.e. SOLR-5476) so maybe that helps and I'm trying to add more debug to see if this is really due to overseer backlog (which I suspect it is). In general, I'm a little confused by the keeping of leader state in multiple places in ZK. Is there any background information on why we have leader state in /clusterstate.json and in the leader path znode? Also, here are some interesting side
[jira] [Commented] (SOLR-3854) SolrCloud does not work with https
[ https://issues.apache.org/jira/browse/SOLR-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922717#comment-13922717 ] ASF subversion and git services commented on SOLR-3854: --- Commit 1574951 from [~steve_rowe] in branch 'dev/trunk' [ https://svn.apache.org/r1574951 ] SOLR-3854: IntelliJ config: add solr example lib test dependency to map-reduce and dataimporthandler contribs SolrCloud does not work with https -- Key: SOLR-3854 URL: https://issues.apache.org/jira/browse/SOLR-3854 Project: Solr Issue Type: Bug Reporter: Sami Siren Assignee: Mark Miller Fix For: 4.7, 5.0 Attachments: SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854v2.patch, SOLR-3854v3.patch, SOLR-3854v4.patch There are a few places in current codebase that assume http is used. This prevents using https when running solr in cloud mode. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3854) SolrCloud does not work with https
[ https://issues.apache.org/jira/browse/SOLR-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922718#comment-13922718 ] ASF subversion and git services commented on SOLR-3854: --- Commit 1574953 from [~steve_rowe] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574953 ] SOLR-3854: IntelliJ config: add solr example lib test dependency to map-reduce and dataimporthandler contribs (merged trunk r1574951) SolrCloud does not work with https -- Key: SOLR-3854 URL: https://issues.apache.org/jira/browse/SOLR-3854 Project: Solr Issue Type: Bug Reporter: Sami Siren Assignee: Mark Miller Fix For: 4.7, 5.0 Attachments: SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854.patch, SOLR-3854v2.patch, SOLR-3854v3.patch, SOLR-3854v4.patch There are a few places in current codebase that assume http is used. This prevents using https when running solr in cloud mode. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922723#comment-13922723 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574954 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574954 ] LUCENE-5493: javadocs Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922746#comment-13922746 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574962 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574962 ] LUCENE-5493: fix precommit Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922757#comment-13922757 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574965 from [~mikemccand] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574965 ] LUCENE-5493: don't do forceMerge on initital build of AnalyzingInfixSuggester Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5495) Boolean Filter does not handle FilterClauses with only bits() implemented
John Wang created LUCENE-5495: - Summary: Boolean Filter does not handle FilterClauses with only bits() implemented Key: LUCENE-5495 URL: https://issues.apache.org/jira/browse/LUCENE-5495 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.6.1 Reporter: John Wang Some Filter implementations produce DocIdSets without the iterator() implementation, such as o.a.l.facet.range.Range.getFilter(). Currently, such filters cannot be added to a BooleanFilter because BooleanFilter expects all FilterClauses with Filters that have iterator() implemented. This patch improves the behavior by taking Filters with bits() implemented and treat them separately. This behavior would be faster in the case for Filters with a forward index as the underlying data structure, where there would be no need to scan the index to build an iterator. See attached unit test, which fails without this patch. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5495) Boolean Filter does not handle FilterClauses with only bits() implemented
[ https://issues.apache.org/jira/browse/LUCENE-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang updated LUCENE-5495: -- Attachment: LUCENE-5495.patch Boolean Filter does not handle FilterClauses with only bits() implemented - Key: LUCENE-5495 URL: https://issues.apache.org/jira/browse/LUCENE-5495 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.6.1 Reporter: John Wang Attachments: LUCENE-5495.patch Some Filter implementations produce DocIdSets without the iterator() implementation, such as o.a.l.facet.range.Range.getFilter(). Currently, such filters cannot be added to a BooleanFilter because BooleanFilter expects all FilterClauses with Filters that have iterator() implemented. This patch improves the behavior by taking Filters with bits() implemented and treat them separately. This behavior would be faster in the case for Filters with a forward index as the underlying data structure, where there would be no need to scan the index to build an iterator. See attached unit test, which fails without this patch. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922772#comment-13922772 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574969 from [~mikemccand] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574969 ] LUCENE-5493: fix solr Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5470) Refactoring multiterm analysis
[ https://issues.apache.org/jira/browse/LUCENE-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-5470: Attachment: LUCENE-5470_QPBase.patch LUCENE-5470_QBuilder.patch Two patches. One consolidates analysis in QueryBuilder and one consolidates it in QueryParserBase. I've added a check for position increment. I don't think this will create false exceptions, but let me know if anyone thinks otherwise. If we go with QueryBuilder, I'm not held to static. Refactoring multiterm analysis -- Key: LUCENE-5470 URL: https://issues.apache.org/jira/browse/LUCENE-5470 Project: Lucene - Core Issue Type: Bug Components: core/queryparser Affects Versions: 5.0 Reporter: Tim Allison Priority: Minor Attachments: LUCENE-5470.patch, LUCENE-5470_QBuilder.patch, LUCENE-5470_QPBase.patch There are currently three methods to analyze multiterms in Lucene and Solr: 1) QueryParserBase 2) AnalyzingQueryParser 3) TextField (Solr) The code in QueryParserBase and in TextField do not consume the tokenstream if more than one token is generated by the analyzer. (Admittedly, thanks to the magic of MultitermAwareComponents in Solr, this type of exception probably never happens and the unconsumed stream problem is probably non-existent in Solr.) I propose consolidating the multiterm analysis code into one place: QueryBuilder in Lucene core. This is part of a refactoring that will also help reduce duplication of code with LUCENE-5205. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922783#comment-13922783 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1574972 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1574972 ] LUCENE-5493: add CHANGES Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch, LUCENE-5493.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-5493: Attachment: LUCENE-5493.patch Here is a patch for review. The public API is much simpler and I think it makes the SortingMP a lot more flexible and easier to use. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch, LUCENE-5493.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922786#comment-13922786 ] Tim Allison commented on LUCENE-5205: - [~rcmuir], if you have a chance to review and commit the Feb 28 patch for cleaning up the test cases, I'd greatly appreciate it! Thank you, again. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5768) Add a distrib.singlePass parameter to make EXECUTE_QUERY phase fetch all fields and skip GET_FIELDS
[ https://issues.apache.org/jira/browse/SOLR-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shalin Shekhar Mangar updated SOLR-5768: Summary: Add a distrib.singlePass parameter to make EXECUTE_QUERY phase fetch all fields and skip GET_FIELDS (was: Add a distrib.singlePass parameter to make GET_FIELDS phase fetch all fields and skip EXECUTE_QUERY) Add a distrib.singlePass parameter to make EXECUTE_QUERY phase fetch all fields and skip GET_FIELDS --- Key: SOLR-5768 URL: https://issues.apache.org/jira/browse/SOLR-5768 Project: Solr Issue Type: Improvement Reporter: Shalin Shekhar Mangar Priority: Minor Fix For: 4.8, 5.0 Suggested by Yonik on solr-user: http://www.mail-archive.com/solr-user@lucene.apache.org/msg95045.html {quote} Although it seems like it should be relatively simple to make it work with other fields as well, by passing down the complete fl requested if some optional parameter is set (distrib.singlePass?) {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5768) Add a distrib.singlePass parameter to make GET_FIELDS phase fetch all fields and skip EXECUTE_QUERY
[ https://issues.apache.org/jira/browse/SOLR-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922796#comment-13922796 ] Shalin Shekhar Mangar commented on SOLR-5768: - see it is already implemented ;) thanks, I'll fix. Add a distrib.singlePass parameter to make GET_FIELDS phase fetch all fields and skip EXECUTE_QUERY --- Key: SOLR-5768 URL: https://issues.apache.org/jira/browse/SOLR-5768 Project: Solr Issue Type: Improvement Reporter: Shalin Shekhar Mangar Priority: Minor Fix For: 4.8, 5.0 Suggested by Yonik on solr-user: http://www.mail-archive.com/solr-user@lucene.apache.org/msg95045.html {quote} Although it seems like it should be relatively simple to make it work with other fields as well, by passing down the complete fl requested if some optional parameter is set (distrib.singlePass?) {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
JDK 8 : Third Release Candidate - Build 132 is available on java.net
Hi Uwe,Dawid, JDK 8 Third Release Candidate , Build 132 is now available for download http://jdk8.java.net/download.html test. Please log all show stopper issues as soon as possible. Thanks for your support, Rory -- Rgds,Rory O'Donnell Quality Engineering Manager Oracle EMEA , Dublin, Ireland
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922811#comment-13922811 ] Michael McCandless commented on LUCENE-5493: +1, looks great. This also makes it trivial to do impact-sorted postings by an arbitrary expression. Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch, LUCENE-5493.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5821) Search inconsistency on SolrCloud replicas
Maxim Novikov created SOLR-5821: --- Summary: Search inconsistency on SolrCloud replicas Key: SOLR-5821 URL: https://issues.apache.org/jira/browse/SOLR-5821 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.6.1 Environment: CentOS 6.5, Tomcat 8.0.3, Solr 4.6.1 1 shard, 2 replicas (servers with identical hardware/software) Reporter: Maxim Novikov Priority: Critical We use the following infrastructure: SolrCloud with 1 shard and 2 replicas. The index is built using DataImportHandler (importing data from the database). The number of items in the index can vary from 100 to 100,000,000. After indexing part of the data (not necessarily all the data, it is enough to have a small number of items in the search index), we can observe that Solr instances (replicas) return different results for the same search queries. I believe it happens because some of the results have the same scores, and Solr instances return those in a random order. PS This is a critical issue for us as we use a load balancer to scale Solr through replicas, and as a result of this issue, we retrieve various results for the same queries all the time. They are not necessarily completely different, but even a couple of items that differ is a deal breaker. The expected behaviour would be to always get identical results for the same search queries from all replicas. Otherwise, this cloud thing works just unreliably. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5492) IndexFileDeleter AssertionError in presence of *_upgraded.si files
[ https://issues.apache.org/jira/browse/LUCENE-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922847#comment-13922847 ] Tim Smith commented on LUCENE-5492: --- that seems to be the culprit in my IndexWriter subclass, i overrode incRefDeleter and decRefDeleter to be no-ops and it no longer fails horribly hopefully this doesn't have any negative effects (looks like that was all that was in the patch on LUCENE-5434, so worst case scenario i just don't get to take advantage of the benefits there) IndexFileDeleter AssertionError in presence of *_upgraded.si files -- Key: LUCENE-5492 URL: https://issues.apache.org/jira/browse/LUCENE-5492 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.7 Reporter: Tim Smith When calling IndexWriter.deleteUnusedFiles against an index that contains 3.x segments, i am seeing the following exception: {code} java.lang.AssertionError: failAndDumpStackJunitStatment: RefCount is 0 pre-decrement for file _0_upgraded.si at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:630) at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:514) at org.apache.lucene.index.IndexFileDeleter.deleteCommits(IndexFileDeleter.java:286) at org.apache.lucene.index.IndexFileDeleter.revisitPolicy(IndexFileDeleter.java:393) at org.apache.lucene.index.IndexWriter.deleteUnusedFiles(IndexWriter.java:4617) {code} I believe this is caused by IndexFileDeleter not being aware of the Lucene3x Segment Infos Format (notably the _upgraded.si files created to upgrade an old index) This is new in 4.7 and did not occur in 4.6.1 Still trying to track down a workaround/fix -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5821) Search inconsistency on SolrCloud replicas
[ https://issues.apache.org/jira/browse/SOLR-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Novikov updated SOLR-5821: Environment: CentOS 6.5, 8Gb RAM, 4 CPUs, 100Gb HDD Tomcat 8.0.3, Solr 4.6.1 1 shard, 2 replicas (servers with identical hardware/software) was: CentOS 6.5, Tomcat 8.0.3, Solr 4.6.1 1 shard, 2 replicas (servers with identical hardware/software) Search inconsistency on SolrCloud replicas -- Key: SOLR-5821 URL: https://issues.apache.org/jira/browse/SOLR-5821 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.6.1 Environment: CentOS 6.5, 8Gb RAM, 4 CPUs, 100Gb HDD Tomcat 8.0.3, Solr 4.6.1 1 shard, 2 replicas (servers with identical hardware/software) Reporter: Maxim Novikov Priority: Critical Labels: cloud, inconsistency, replica, search We use the following infrastructure: SolrCloud with 1 shard and 2 replicas. The index is built using DataImportHandler (importing data from the database). The number of items in the index can vary from 100 to 100,000,000. After indexing part of the data (not necessarily all the data, it is enough to have a small number of items in the search index), we can observe that Solr instances (replicas) return different results for the same search queries. I believe it happens because some of the results have the same scores, and Solr instances return those in a random order. PS This is a critical issue for us as we use a load balancer to scale Solr through replicas, and as a result of this issue, we retrieve various results for the same queries all the time. They are not necessarily completely different, but even a couple of items that differ is a deal breaker. The expected behaviour would be to always get identical results for the same search queries from all replicas. Otherwise, this cloud thing works just unreliably. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5821) Search inconsistency on SolrCloud replicas
[ https://issues.apache.org/jira/browse/SOLR-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Novikov updated SOLR-5821: Environment: SolrCloud: 1 shard, 2 replicas Both instances/replicas have identical hardware/software: CPU(s): 4 RAM: 8Gb HDD: 100Gb OS: CentOS 6.5 ZooKeeper 3.4.5 Tomcat 8.0.3 Solr 4.6.1 Servers are utilized to run Solr only. was: CentOS 6.5, 8Gb RAM, 4 CPUs, 100Gb HDD Tomcat 8.0.3, Solr 4.6.1 1 shard, 2 replicas (servers with identical hardware/software) Search inconsistency on SolrCloud replicas -- Key: SOLR-5821 URL: https://issues.apache.org/jira/browse/SOLR-5821 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.6.1 Environment: SolrCloud: 1 shard, 2 replicas Both instances/replicas have identical hardware/software: CPU(s): 4 RAM: 8Gb HDD: 100Gb OS: CentOS 6.5 ZooKeeper 3.4.5 Tomcat 8.0.3 Solr 4.6.1 Servers are utilized to run Solr only. Reporter: Maxim Novikov Priority: Critical Labels: cloud, inconsistency, replica, search We use the following infrastructure: SolrCloud with 1 shard and 2 replicas. The index is built using DataImportHandler (importing data from the database). The number of items in the index can vary from 100 to 100,000,000. After indexing part of the data (not necessarily all the data, it is enough to have a small number of items in the search index), we can observe that Solr instances (replicas) return different results for the same search queries. I believe it happens because some of the results have the same scores, and Solr instances return those in a random order. PS This is a critical issue for us as we use a load balancer to scale Solr through replicas, and as a result of this issue, we retrieve various results for the same queries all the time. They are not necessarily completely different, but even a couple of items that differ is a deal breaker. The expected behaviour would be to always get identical results for the same search queries from all replicas. Otherwise, this cloud thing works just unreliably. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922884#comment-13922884 ] Shai Erera commented on LUCENE-5476: bq. but any facet accumulation which would rely on document scores would be hit by the second as the scores That's a great point Gilad. We need a test which covers that with random sampling collector. bq. Is there a reason to add more randomness to one test? It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness. But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness -- it's ok to test it once. Also, in terms of test coverage, there are other cases which I think would be good if they were tested: * Docs + Scores (discussed above) * Multi-segment indexes (ensuring we work well there) * Different number of hits per-segment (to make sure our sampling on tiny segments works well too) * ... I wouldn't for example use RandomIndexWriter because we're only testing search. If we want many segments, we should commit/nrt-open every few segments, disable merge policy etc. These can be separate, real unit, tests. bq. Sorry, I don't get what you mean by this. I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler should not be hardcoded to 1,000, but {{numDocs * 0.1}}. bq. the original totalHits .. is used I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no? {quote} There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen {quote} Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing, but if we can avoid that at all it's better. First, it's not one of the values, we could be missing even 2 right -- really depends on randomness. I find this assert just redundant -- if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get 5 and we're OK with it .. what's the point of asserting that at all? bq. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater. It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention? Facet sampling -- Key: LUCENE-5476 URL: https://issues.apache.org/jira/browse/LUCENE-5476 Project: Lucene - Core Issue Type: Improvement Reporter: Rob Audenaerde Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java With LUCENE-5339 facet sampling disappeared. When trying to display facet counts on large datasets (10M documents) counting facets is rather expensive, as all the hits are collected and processed. Sampling greatly reduced this and thus provided a nice speedup. Could it be brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922884#comment-13922884 ] Shai Erera edited comment on LUCENE-5476 at 3/6/14 6:56 PM: bq. but any facet accumulation which would rely on document scores would be hit by the second as the scores That's a great point Gilad. We need a test which covers that with random sampling collector. bq. Is there a reason to add more randomness to one test? It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness. But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness -- it's ok to test it once. Also, in terms of test coverage, there are other cases which I think would be good if they were tested: * Docs + Scores (discussed above) * Multi-segment indexes (ensuring we work well there) * Different number of hits per-segment (to make sure our sampling on tiny segments works well too) * ... I wouldn't for example use RandomIndexWriter because we're only testing search (and so it just adds noise in this test). If we want many segments, we should commit/nrt-open every few docs, disable merge policy etc. These can be separate, real unit-, tests. bq. Sorry, I don't get what you mean by this. I meant that if you set {{numDocs = atLeast(8000)}}, then the 10% sampler should not be hardcoded to 1,000, but {{numDocs * 0.1}}. bq. the original totalHits .. is used I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no? {quote} There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen {quote} Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing, but if we can avoid that at all it's better. First, it's not one of the values, we could be missing even 2 right -- really depends on randomness. I find this assert just redundant -- if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get 5 and we're OK with it .. what's the point of asserting that at all? bq. I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater. It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention? was (Author: shaie): bq. but any facet accumulation which would rely on document scores would be hit by the second as the scores That's a great point Gilad. We need a test which covers that with random sampling collector. bq. Is there a reason to add more randomness to one test? It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness. But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness -- it's ok to test it once. Also, in terms of test coverage, there are other cases which I think would be good if they were tested: * Docs + Scores (discussed above) * Multi-segment indexes (ensuring we work well there) * Different number of hits per-segment (to make sure our sampling on tiny segments works well too) * ... I wouldn't for example use RandomIndexWriter because we're only testing search. If we want many segments, we should commit/nrt-open every few segments, disable merge policy etc. These can
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922896#comment-13922896 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1575008 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1575008 ] LUCENE-5493: javadocs cleanups Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch, LUCENE-5493.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5493) Rename Sorter, NumericDocValuesSorter, and fix javadocs
[ https://issues.apache.org/jira/browse/LUCENE-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922918#comment-13922918 ] ASF subversion and git services commented on LUCENE-5493: - Commit 1575017 from [~rcmuir] in branch 'dev/branches/lucene5493' [ https://svn.apache.org/r1575017 ] LUCENE-5493: add missing experimental tag Rename Sorter, NumericDocValuesSorter, and fix javadocs --- Key: LUCENE-5493 URL: https://issues.apache.org/jira/browse/LUCENE-5493 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-5493-poc.patch, LUCENE-5493.patch Its not clear to users that these are for this super-expert thing of pre-sorting the index. From the names and documentation they think they should use them instead of Sort/SortField. These need to be renamed or, even better, the API fixed so they aren't public classes. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-SmokeRelease-trunk - Build # 163 - Failure
On Mar 6, 2014, at 1:24 PM, Michael McCandless luc...@mikemccandless.com wrote: Should we stop running solr tests in the smoke tester? I think the current best bet if people insist on running Solr tests in the smoke tester is to do it with -Dtests.slow=false. - Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5495) Boolean Filter does not handle FilterClauses with only bits() implemented
[ https://issues.apache.org/jira/browse/LUCENE-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922940#comment-13922940 ] Lei Wang commented on LUCENE-5495: -- + public BitsDocIdSet(Bits bits, int length) { +this.bits = bits; +this.length = length; + } We can assert bits is not a DocIdSet here. if it is, this adds overhead only. +final ListBits mustBitsList = new ArrayListBits(); +final ListBits mustNotBitsList = new ArrayListBits(); May need a SHOULD list also? +if (bits != null) { + mustNotBitsList.add(bits); +} if bits is already a FixedBitSet or OpenBitSet, merge them into res might be faster? same for other lists Boolean Filter does not handle FilterClauses with only bits() implemented - Key: LUCENE-5495 URL: https://issues.apache.org/jira/browse/LUCENE-5495 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.6.1 Reporter: John Wang Attachments: LUCENE-5495.patch Some Filter implementations produce DocIdSets without the iterator() implementation, such as o.a.l.facet.range.Range.getFilter(). Currently, such filters cannot be added to a BooleanFilter because BooleanFilter expects all FilterClauses with Filters that have iterator() implemented. This patch improves the behavior by taking Filters with bits() implemented and treat them separately. This behavior would be faster in the case for Filters with a forward index as the underlying data structure, where there would be no need to scan the index to build an iterator. See attached unit test, which fails without this patch. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5495) Boolean Filter does not handle FilterClauses with only bits() implemented
[ https://issues.apache.org/jira/browse/LUCENE-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922940#comment-13922940 ] Lei Wang edited comment on LUCENE-5495 at 3/6/14 7:35 PM: -- {noformat} + public BitsDocIdSet(Bits bits, int length) { +this.bits = bits; +this.length = length; + } {noformat} We can assert bits is not a DocIdSet here. if it is, this adds overhead only. {noformat} +final ListBits mustBitsList = new ArrayListBits(); +final ListBits mustNotBitsList = new ArrayListBits(); {noformat} May need a SHOULD list also? {noformat} +if (bits != null) { + mustNotBitsList.add(bits); +} {noformat} if bits is already a FixedBitSet or OpenBitSet, merge them into res might be faster? same for other lists was (Author: wonlay): + public BitsDocIdSet(Bits bits, int length) { +this.bits = bits; +this.length = length; + } We can assert bits is not a DocIdSet here. if it is, this adds overhead only. +final ListBits mustBitsList = new ArrayListBits(); +final ListBits mustNotBitsList = new ArrayListBits(); May need a SHOULD list also? +if (bits != null) { + mustNotBitsList.add(bits); +} if bits is already a FixedBitSet or OpenBitSet, merge them into res might be faster? same for other lists Boolean Filter does not handle FilterClauses with only bits() implemented - Key: LUCENE-5495 URL: https://issues.apache.org/jira/browse/LUCENE-5495 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.6.1 Reporter: John Wang Attachments: LUCENE-5495.patch Some Filter implementations produce DocIdSets without the iterator() implementation, such as o.a.l.facet.range.Range.getFilter(). Currently, such filters cannot be added to a BooleanFilter because BooleanFilter expects all FilterClauses with Filters that have iterator() implemented. This patch improves the behavior by taking Filters with bits() implemented and treat them separately. This behavior would be faster in the case for Filters with a forward index as the underlying data structure, where there would be no need to scan the index to build an iterator. See attached unit test, which fails without this patch. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-SmokeRelease-trunk - Build # 163 - Failure
On Thu, Mar 6, 2014 at 2:33 PM, Mark Miller markrmil...@gmail.com wrote: On Mar 6, 2014, at 1:24 PM, Michael McCandless luc...@mikemccandless.com wrote: Should we stop running solr tests in the smoke tester? I think the current best bet if people insist on running Solr tests in the smoke tester is to do it with -Dtests.slow=false. I was the one that added it. I don't insist on running them in the smokeTester, i just felt like it was the right thing to do. Do you think we should turn them off? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5596) OverseerTest.testOverseerFailure - leader node already exists.
[ https://issues.apache.org/jira/browse/SOLR-5596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922947#comment-13922947 ] Mark Miller commented on SOLR-5596: --- So we still hit this - pretty surprising. I've gone over the test a couple times and have not spotted the problem yet, but I think it must be an issue with the test. OverseerTest.testOverseerFailure - leader node already exists. -- Key: SOLR-5596 URL: https://issues.apache.org/jira/browse/SOLR-5596 Project: Solr Issue Type: Bug Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 Seeing this a bunch on jenkins - previous leader ephemeral node is still around for some reason. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-SmokeRelease-trunk - Build # 163 - Failure
On Mar 6, 2014, at 2:36 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Mar 6, 2014 at 2:33 PM, Mark Miller markrmil...@gmail.com wrote: On Mar 6, 2014, at 1:24 PM, Michael McCandless luc...@mikemccandless.com wrote: Should we stop running solr tests in the smoke tester? I think the current best bet if people insist on running Solr tests in the smoke tester is to do it with -Dtests.slow=false. I was the one that added it. I don't insist on running them in the smokeTester, i just felt like it was the right thing to do. Do you think we should turn them off? Like I said, I think if we want to run them currently, we should do it with -Dtests.slow=false. I do think it would be nice to be able to run them all, but I think step one is probably going from no tests to -Dtests.slow=false. With a little effort from other Solr devs, we are not too far off from being able to do the whole suit at this point - there has been a bunch of progress from a variety of sources over the past few weeks. - Mark http://about.me/markrmiller - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-SmokeRelease-trunk - Build # 163 - Failure
I think its a good step. Mainly i hate not running them at all like before. At least this way we are testing, even if some are disabled. I'll fix it. On Thu, Mar 6, 2014 at 2:45 PM, Mark Miller markrmil...@gmail.com wrote: On Mar 6, 2014, at 2:36 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Mar 6, 2014 at 2:33 PM, Mark Miller markrmil...@gmail.com wrote: On Mar 6, 2014, at 1:24 PM, Michael McCandless luc...@mikemccandless.com wrote: Should we stop running solr tests in the smoke tester? I think the current best bet if people insist on running Solr tests in the smoke tester is to do it with -Dtests.slow=false. I was the one that added it. I don't insist on running them in the smokeTester, i just felt like it was the right thing to do. Do you think we should turn them off? Like I said, I think if we want to run them currently, we should do it with -Dtests.slow=false. I do think it would be nice to be able to run them all, but I think step one is probably going from no tests to -Dtests.slow=false. With a little effort from other Solr devs, we are not too far off from being able to do the whole suit at this point - there has been a bunch of progress from a variety of sources over the past few weeks. - Mark http://about.me/markrmiller - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Dataimport handler
I'm using the dataimporthandler to index data from a mysql DB. Been running it just fine. I've been using full-imports. I'm now trying implement the delta import functionality. To implement the delta query, you need to be reading the last_index_time from a properties file to know what new to index. So I'm using the parameter: {dataimporter.last_index_time} within my query. The problem is when I use this, the date always is : Thu Jan 01 00:00:00 UTC 1970. It's never actually reading the correct date stored in the dataimport.properties file. So my delta query does not work. Has anybody see this issue? Seems like its always using the beginning date for epoch or unix timestamp code 0. --Pritesh P.S. If you want to see the delta query, see below. deltaQuery=SELECT node.nid from node where node.type = 'news' and node.status = 1 and (node.changed gt; UNIX_TIMESTAMP('${dataimporter.last_index_time}'jgkg) or node.created gt; UNIX_TIMESTAMP('${dataimporter.last_index_time}')) deltaImportQuery=SELECT node.nid, node.vid, node.type, node.language, node.title, node.uid, node.status, FROM_UNIXTIME(node.created,'%Y-%m-%dT%TZ') as created, FROM_UNIXTIME(node.changed,'%Y-%m-%dT%TZ') as changed, node.comment, node.promote, node.moderate, node.sticky, node.tnid, node.translate, content_type_news.field_image_credit_value, content_type_news.field_image_caption_value, content_type_news.field_subhead_value, content_type_news.field_author_value, content_type_news.field_dateline_value, content_type_news.field_article_image_fid, content_type_news.field_article_image_list, content_type_news.field_article_image_data, content_type_news.field_news_blurb_value, content_type_news.field_news_blurb_format, content_type_news.field_news_syndicate_value, content_type_news.field_news_video_reference_nid, content_type_news.field_news_inline_location_value, content_type_news.field_article_contributor_nid, content_type_news.field_news_title_value, page_title.page_title FROM node LEFT JOIN content_type_news ON node.nid = content_type_news.nid LEFT JOIN page_title ON node.nid = page_title.id where node.type = 'news' and node.status = 1 and node.nid = '${deltaimport.delta.nid}'
[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922974#comment-13922974 ] Robert Muir commented on LUCENE-5205: - Sorry Tim! I'll try to get to this today. [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5496) Nuke fuzzyMinSim and replace with maxEdits for FuzzyQuery and its friends
Tim Allison created LUCENE-5496: --- Summary: Nuke fuzzyMinSim and replace with maxEdits for FuzzyQuery and its friends Key: LUCENE-5496 URL: https://issues.apache.org/jira/browse/LUCENE-5496 Project: Lucene - Core Issue Type: Task Components: core/queryparser, core/search Affects Versions: 4.8, 5.0 Reporter: Tim Allison Priority: Minor As we get closer to 5.0, I propose adding some deprecations in the queryparsers realm of 4.x. Are we ready to get rid of all fuzzyMinSims in trunk? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5205) [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser
[ https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922978#comment-13922978 ] Tim Allison commented on LUCENE-5205: - You've had far bigger fish to fry...np at all! [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to classic QueryParser --- Key: LUCENE-5205 URL: https://issues.apache.org/jira/browse/LUCENE-5205 Project: Lucene - Core Issue Type: Improvement Components: core/queryparser Reporter: Tim Allison Labels: patch Fix For: 4.7 Attachments: LUCENE-5205-cleanup-tests.patch, LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt This parser extends QueryParserBase and includes functionality from: * Classic QueryParser: most of its syntax * SurroundQueryParser: recursive parsing for near and not clauses. * ComplexPhraseQueryParser: can handle near queries that include multiterms (wildcard, fuzzy, regex, prefix), * AnalyzingQueryParser: has an option to analyze multiterms. At a high level, there's a first pass BooleanQuery/field parser and then a span query parser handles all terminal nodes and phrases. Same as classic syntax: * term: test * fuzzy: roam~0.8, roam~2 * wildcard: te?t, test*, t*st * regex: /\[mb\]oat/ * phrase: jakarta apache * phrase with slop: jakarta apache~3 * default or clause: jakarta apache * grouping or clause: (jakarta apache) * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta * multiple fields: title:lucene author:hatcher Main additions in SpanQueryParser syntax vs. classic syntax: * Can require in order for phrases with slop with the \~ operator: jakarta apache\~3 * Can specify not near: fever bieber!\~3,10 :: find fever but not if bieber appears within 3 words before or 10 words after it. * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta apache\]~3 lucene\]\~4 :: find jakarta within 3 words of apache, and that hit has to be within four words before lucene * Can also use \[\] for single level phrasal queries instead of as in: \[jakarta apache\] * Can use or grouping clauses in phrasal queries: apache (lucene solr)\~3 :: find apache and then either lucene or solr within three words. * Can use multiterms in phrasal queries: jakarta\~1 ap*che\~2 * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like jakarta within two words of ap*che and that hit has to be within ten words of something like solr or that lucene regex. * Can require at least x number of hits at boolean level: apache AND (lucene solr tika)~2 * Can use negative only query: -jakarta :: Find all docs that don't contain jakarta * Can use an edit distance 2 for fuzzy query via SlowFuzzyQuery (beware of potential performance issues!). Trivial additions: * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, prefix =2) * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance =2: (jakarta~1 (OSA) vs jakarta~1(Levenshtein) This parser can be very useful for concordance tasks (see also LUCENE-5317 and LUCENE-5318) and for analytical search. Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery. Most of the documentation is in the javadoc for SpanQueryParser. Any and all feedback is welcome. Thank you. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5496) Nuke fuzzyMinSim and replace with maxEdits for FuzzyQuery and its friends
[ https://issues.apache.org/jira/browse/LUCENE-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated LUCENE-5496: Attachment: LUCENE-5496_4x_deprecations.patch Deprecations for 4x. This doesn't touch: SlowFuzzyQuery -- deprecated anyways FuzzyTermsEnum -- on the theory that if you're extending this, you know what's coming EDismax in Solr -- ditto I'm not even sure we need these added deprecations in 4x, but I attach this if the community would like to add them. Nuke fuzzyMinSim and replace with maxEdits for FuzzyQuery and its friends - Key: LUCENE-5496 URL: https://issues.apache.org/jira/browse/LUCENE-5496 Project: Lucene - Core Issue Type: Task Components: core/queryparser, core/search Affects Versions: 4.8, 5.0 Reporter: Tim Allison Priority: Minor Attachments: LUCENE-5496_4x_deprecations.patch As we get closer to 5.0, I propose adding some deprecations in the queryparsers realm of 4.x. Are we ready to get rid of all fuzzyMinSims in trunk? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5477) Async execution of OverseerCollectionProcessor tasks
[ https://issues.apache.org/jira/browse/SOLR-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922994#comment-13922994 ] Anshum Gupta commented on SOLR-5477: [~markrmil...@gmail.com] I haven't added anything for SolrJ so for now, it doesn't really support async calls. I am assuming that by collection API SolrJ calls you mean methods like CollectionAdminRequest.createCollection(). Also, I'm working on adding some stress tests i.e. something that fires multiple async requests. Async execution of OverseerCollectionProcessor tasks Key: SOLR-5477 URL: https://issues.apache.org/jira/browse/SOLR-5477 Project: Solr Issue Type: Sub-task Components: SolrCloud Reporter: Noble Paul Assignee: Anshum Gupta Attachments: SOLR-5477-CoreAdminStatus.patch, SOLR-5477-updated.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch, SOLR-5477.patch Typical collection admin commands are long running and it is very common to have the requests get timed out. It is more of a problem if the cluster is very large.Add an option to run these commands asynchronously add an extra param async=true for all collection commands the task is written to ZK and the caller is returned a task id. as separate collection admin command will be added to poll the status of the task command=statusid=7657668909 if id is not passed all running async tasks should be listed A separate queue is created to store in-process tasks . After the tasks are completed the queue entry is removed. OverSeerColectionProcessor will perform these tasks in multiple threads -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5822) ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1
[ https://issues.apache.org/jira/browse/SOLR-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923001#comment-13923001 ] Mark Miller commented on SOLR-5822: --- {noformat} [exec][junit4] 2 1114710 T4403 C687 P55523 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2CONTROL=TRUE} {add=[0-672 (1461849649443766272)]} 0 1 [exec][junit4] 2 1114718 T4530 C692 P29636 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2distrib.from=https://127.0.0.1:10132/collection1/update.distrib=FROMLEADER} {add=[0-672 (1461849649447960576)]} 0 1 [exec][junit4] 2 1114722 T4838 C693 P17000 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2distrib.from=https://127.0.0.1:10132/collection1/update.distrib=FROMLEADER} {add=[0-672 (1461849649447960576)]} 0 5 [exec][junit4] 2 1114723 T4483 C686 P10132 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2} {add=[0-672 (1461849649447960576)]} 0 9 [exec][junit4] 2 1117236 T4403 C694 P55523 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2CONTROL=TRUE} {delete=[0-672 (-1461849652091420672)]} 0 1 [exec][junit4] 2 1117242 T4530 C695 P29636 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2distrib.from=https://127.0.0.1:10132/collection1/update.distrib=FROMLEADER} {delete=[0-672 (-1461849652095614976)]} 0 0 [exec][junit4] 2 1123567 T4867 C702 P17000 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2distrib.from=https://127.0.0.1:10132/collection1/update.distrib=FROMLEADER} {delete=[0-672 (-1461849652095614976)]} 0 1 [exec][junit4] 2 1123568 T4483 C703 P10132 oasup.LogUpdateProcessor.finish [collection1] webapp= path=/update params={wt=javabinversion=2} {delete=[0-672 (-1461849652095614976)]} 0 6329 [exec][junit4] 2 ## Only in https://127.0.0.1:17000/collection1: [{id=0-672, _version_=1461849649447960576}] {noformat} ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1 -- Key: SOLR-5822 URL: https://issues.apache.org/jira/browse/SOLR-5822 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 ChaosMonkeyNothingIsSafeTest.testDistribSearch [exec][junit4] Throwable #1: java.lang.AssertionError: shard2 is not consistent. Got 300 from https://127.0.0.1:17000/collection1lastClient and got 299 from https://127.0.0.1:10132/collection1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5822) ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1
Mark Miller created SOLR-5822: - Summary: ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1 Key: SOLR-5822 URL: https://issues.apache.org/jira/browse/SOLR-5822 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 ChaosMonkeyNothingIsSafeTest.testDistribSearch [exec][junit4] Throwable #1: java.lang.AssertionError: shard2 is not consistent. Got 300 from https://127.0.0.1:17000/collection1lastClient and got 299 from https://127.0.0.1:10132/collection1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5822) ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1
[ https://issues.apache.org/jira/browse/SOLR-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923004#comment-13923004 ] Mark Miller commented on SOLR-5822: --- Looks like it's finding 0-672 on P17000, even though it looks like the delete of 0-672 on P17000 was received fine. ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1 -- Key: SOLR-5822 URL: https://issues.apache.org/jira/browse/SOLR-5822 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 ChaosMonkeyNothingIsSafeTest.testDistribSearch [exec][junit4] Throwable #1: java.lang.AssertionError: shard2 is not consistent. Got 300 from https://127.0.0.1:17000/collection1lastClient and got 299 from https://127.0.0.1:10132/collection1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5822) ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1
[ https://issues.apache.org/jira/browse/SOLR-5822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-5822: -- Attachment: solr.logs ChaosMonkeyNothingIsSafeTest.testDistribSearch fail, shard inconsistency, off by 1 -- Key: SOLR-5822 URL: https://issues.apache.org/jira/browse/SOLR-5822 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Mark Miller Assignee: Mark Miller Fix For: 4.8, 5.0 Attachments: solr.logs ChaosMonkeyNothingIsSafeTest.testDistribSearch [exec][junit4] Throwable #1: java.lang.AssertionError: shard2 is not consistent. Got 300 from https://127.0.0.1:17000/collection1lastClient and got 299 from https://127.0.0.1:10132/collection1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5487) Can we separate top scorer from sub scorer?
[ https://issues.apache.org/jira/browse/LUCENE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923119#comment-13923119 ] ASF subversion and git services commented on LUCENE-5487: - Commit 1575057 from [~mikemccand] in branch 'dev/branches/lucene5487' [ https://svn.apache.org/r1575057 ] LUCENE-5487: add TopScorers to FilteredQuery too; fix Solr; resolve all nocommits Can we separate top scorer from sub scorer? --- Key: LUCENE-5487 URL: https://issues.apache.org/jira/browse/LUCENE-5487 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5487.patch, LUCENE-5487.patch This is just an exploratory patch ... still many nocommits, but I think it may be promising. I find the two booleans we pass to Weight.scorer confusing, because they really only apply to whoever will call score(Collector) (just IndexSearcher and BooleanScorer). The params are pointless for the vast majority of scorers, because very, very few query scorers really need to change how top-scoring is done, and those scorers can *only* score top-level (throw throw UOE from nextDoc/advance). It seems like these two types of scorers should be separately typed. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5487) Can we separate top scorer from sub scorer?
[ https://issues.apache.org/jira/browse/LUCENE-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923134#comment-13923134 ] Michael McCandless commented on LUCENE-5487: Rob suggested BulkScorer as a better name than TopScorer ... I like it ... I'll rename it. Can we separate top scorer from sub scorer? --- Key: LUCENE-5487 URL: https://issues.apache.org/jira/browse/LUCENE-5487 Project: Lucene - Core Issue Type: Improvement Components: core/search Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-5487.patch, LUCENE-5487.patch This is just an exploratory patch ... still many nocommits, but I think it may be promising. I find the two booleans we pass to Weight.scorer confusing, because they really only apply to whoever will call score(Collector) (just IndexSearcher and BooleanScorer). The params are pointless for the vast majority of scorers, because very, very few query scorers really need to change how top-scoring is done, and those scorers can *only* score top-level (throw throw UOE from nextDoc/advance). It seems like these two types of scorers should be separately typed. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Stalled unit tests
I'm sure that I'm just missing something obvious but I'm having trouble getting the unit tests to run to completion on my laptop and was hoping that someone would be kind enough to point me in the right direction. I've cloned the repository from GitHub ( http://git.apache.org/lucene-solr.git) and checked out the latest commit on branch_4x. commit 6e06247cec1410f32592bfd307c1020b814def06 Author: Robert Muir rm...@apache.org Date: Thu Mar 6 19:54:07 2014 + disable slow solr tests in smoketester git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x@157502513f79535-47bb-0310-9956-ffa450edef68 Executing ant clean test from the top level directory of the project shows the tests running but they seems to get stuck in loop with some stalled heartbeat messages. If I run the tests directly from lucene/ then they complete successfully after about 10 minutes. I'm using Java 6 under OS X (10.9.2). $ java -version java version 1.6.0_65 Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609) Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode) My terminal lists repeating stalled heartbeat messages like so: HEARTBEAT J2 PID(20104@onyx.local): 2014-03-06T16:53:35, stalled for 2111s at: HdfsLockFactoryTest.testBasic HEARTBEAT J0 PID(20106@onyx.local): 2014-03-06T16:53:47, stalled for 2108s at: TestSurroundQueryParser.testQueryParser HEARTBEAT J1 PID(20103@onyx.local): 2014-03-06T16:54:11, stalled for 2167s at: TestRecoveryHdfs.testBuffering HEARTBEAT J3 PID(20105@onyx.local): 2014-03-06T16:54:23, stalled for 2165s at: HdfsDirectoryTest.testEOF My machine does have 3 java processes chewing CPU, see attached jstack dumps for more information. Should I expect the tests to complete on my platform? Do I need to specify any special flags to give them more memory or to avoid any bad apples? Thanks in advance, --Terry 20103.jstack.txt.gz Description: GNU Zip compressed data 20104.jstack.txt.gz Description: GNU Zip compressed data 20105.jstack.txt.gz Description: GNU Zip compressed data - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5488) FilteredQuery.explain does not honor FilterStrategy
[ https://issues.apache.org/jira/browse/LUCENE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang updated LUCENE-5488: -- Attachment: LUCENE-5488.patch FilteredQuery.explain does not honor FilterStrategy --- Key: LUCENE-5488 URL: https://issues.apache.org/jira/browse/LUCENE-5488 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 4.6.1 Reporter: John Wang Attachments: LUCENE-5488.patch, LUCENE-5488.patch Some Filter implementations produce DocIdSets without the iterator() implementation, such as o.a.l.facet.range.Range.getFilter(). It is done with the intention to be used in conjunction with FilteredQuery with FilterStrategy set to be QUERY_FIRST_FILTER_STRATEGY for performance reasons. However, this behavior is not honored by FilteredQuery.explain where docidset.iterator is called regardless and causing such valid usages of above filter types to fail. The fix is to check bits() first and and fall back to iterator if bits is null. In which case, the input Filter is indeed bad. See attached unit test, which fails without this patch. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org