[jira] [Commented] (LUCENE-4902) Add a FilterDirectoryReader
[ https://issues.apache.org/jira/browse/LUCENE-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623748#comment-13623748 ] Adrien Grand commented on LUCENE-4902: -- +1 > Add a FilterDirectoryReader > --- > > Key: LUCENE-4902 > URL: https://issues.apache.org/jira/browse/LUCENE-4902 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Minor > Attachments: LUCENE-4902.patch, LUCENE-4902.patch > > > A FilterDirectoryReader would allow you to easily wrap all subreaders of a > DirectoryReader with FilterAtomicReaders. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4903: - Attachment: LUCENE-4903.patch Patch * checks for in-order scoring when applicable * checks score values (not INFINITY or NaN) * checks that Scorer.score() is not called before iteration started or after it finished * reuses assertions of DocsEnum on Scorer * makes sure that nextDoc() and advance(target) are not called directly on "top scorers" (only from score(Collector)). * makes more tests use LuceneTestCase.newSearcher (most of the patch size) > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4676) Share a Lucene FieldType instance instead of creating on each call to createField()
[ https://issues.apache.org/jira/browse/SOLR-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623762#comment-13623762 ] Adrien Grand commented on SOLR-4676: {quote} I agree with both of these statements. Can we remove createField() and eliminate this trap? DocumentBuilder only calls createFields() and thats... the only thing that should be calling this method? {quote} +1 > Share a Lucene FieldType instance instead of creating on each call to > createField() > --- > > Key: SOLR-4676 > URL: https://issues.apache.org/jira/browse/SOLR-4676 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis >Reporter: David Smiley >Assignee: David Smiley >Priority: Minor > Attachments: SOLR-4676_Share_Lucene_FieldType_in_SchemaField.patch > > > I think the Lucene FieldType instances should be cached on Solr's SchemaField > so that they don't have to be needlessly re-created for each indexed value > that runs through Solr in SchemaField.createField(). The only obstacle I see > to this is that getIndexOptions(field,val) takes the value, and if that value > were to alter the logic then the FieldType can't be shared. This is a > protected method and I don't see anything that overrides it, and the default > implementation doesn't use the value. So I think it can be removed. Patch in > progress... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4858: - Attachment: LUCENE-4858.patch Thanks Shai, this looks good! I modified a bit your patch to fix the collector constructor visiblity (from protected to public) and added some documentation. I'd like to discuss whether we should actually add the name of the Sorter class in the "sorter" property of the diagnostics. I would rather remove it so that renaming a Sorter class doesn't break compatibility, what do you think? > Early termination with SortingMergePolicy > - > > Key: LUCENE-4858 > URL: https://issues.apache.org/jira/browse/LUCENE-4858 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, > LUCENE-4858.patch, LUCENE-4858.patch > > > Spin-off of LUCENE-4752, see > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 > and > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 > When an index is sorted per-segment, queries that sort according to the index > sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624419#comment-13624419 ] Adrien Grand commented on LUCENE-4903: -- The problem is that scorers are hard to track: scoring usually happens by calling Scorer.score(Collector), which itself calls Collector.setScorer(Scorer). Since the asserting scorer delegates to the wrapped one, the asserting scorer gets lost, this is why Collector.setScorer tries to get it back by using a weak hash map. I'm not totally happy with it either and would really like to make Scorer.score(Collector) use methods from the asserting scorer directly. We can't rely on Scorer.score(Collector)'s default implementation since it relies on Scorer.nextDoc and some scorers such as BooleanScorer don't implement this method. > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4911) Missing word "cela" in conf/lang/stopwords_fr.txt
[ https://issues.apache.org/jira/browse/LUCENE-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4911. -- Resolution: Fixed Pierre, I just applied your patch to Lucene's stop list (http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt?view=diff&r1=1465255&r2=1465256&pathrev=1465256). Thank you! This fix should be available in Lucene/Solr 4.3. I also sent an email to snowball-discuss to mention this improvement: http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001462.html > Missing word "cela" in conf/lang/stopwords_fr.txt > - > > Key: LUCENE-4911 > URL: https://issues.apache.org/jira/browse/LUCENE-4911 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 4.2 >Reporter: Pierre Kobylanski >Assignee: Adrien Grand >Priority: Trivial > Attachments: stopwords_fr.txt.patch > > Original Estimate: 10m > Remaining Estimate: 10m > > NB: Not sure this defect is assigned to the right component. > In file example/solr/collection1/conf/lang/stopwords_fr.txt, > there is the word "celà". Though incorrect in French (cf > http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add > the correct spelling (e.g. "cela", whitout accent) to that stopwords list. > Another thing: I noticed that "celà" is the only word of the list followed by > an unbreakable space. Is that wanted? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626429#comment-13626429 ] Adrien Grand commented on LUCENE-4858: -- Thanks for updating the patch, Shai. bq. Adrien, do we have anything else to do here, or are we ready to go? If so, I'll add a CHANGES entry and commit later. The patch looks good to me. Maybe NumericDocValuesSorter.getID() could just return 'fieldName'? I think it's not necessary to describe the doc values type since they are exclusive and doc values are the natural way to sort documents by field values in Lucene? Otherwise +1. > Early termination with SortingMergePolicy > - > > Key: LUCENE-4858 > URL: https://issues.apache.org/jira/browse/LUCENE-4858 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, > LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch > > > Spin-off of LUCENE-4752, see > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 > and > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 > When an index is sorted per-segment, queries that sort according to the index > sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626529#comment-13626529 ] Adrien Grand commented on LUCENE-4858: -- bq. The reason I did that is in case someone will want to sort by a stored field and numeric field which have same names. A Sorter which sorts by stored field values would indeed need to add more information to its ID (at least to say that it is a stored field). bq. "numericdv_field" is really unique, as you cannot have two numeric DV fields with the same name, but different meaning. Since doc values types are exclusive, could we then just say that these are doc values without mentioning the type? I think this would help keep up with doc values types evolutions (for example there used to be BYTES_FIXED_SORTED and BYTES_VAR_SORTED which have been merged into SORTED) and/or additions (SORTED_SET). I would also prefer having something even more human-readable (like "DocValues(fieldName=$fieldName,order=asc|desc)"?). > Early termination with SortingMergePolicy > - > > Key: LUCENE-4858 > URL: https://issues.apache.org/jira/browse/LUCENE-4858 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, > LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch > > > Spin-off of LUCENE-4752, see > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 > and > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 > When an index is sorted per-segment, queries that sort according to the index > sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626548#comment-13626548 ] Adrien Grand commented on LUCENE-4858: -- Sounds good to me! > Early termination with SortingMergePolicy > - > > Key: LUCENE-4858 > URL: https://issues.apache.org/jira/browse/LUCENE-4858 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, > LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch > > > Spin-off of LUCENE-4752, see > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 > and > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 > When an index is sorted per-segment, queries that sort according to the index > sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626637#comment-13626637 ] Adrien Grand commented on LUCENE-4858: -- +1 > Early termination with SortingMergePolicy > - > > Key: LUCENE-4858 > URL: https://issues.apache.org/jira/browse/LUCENE-4858 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, > LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch > > > Spin-off of LUCENE-4752, see > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565 > and > https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282 > When an index is sorted per-segment, queries that sort according to the index > sort order could be early terminated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626638#comment-13626638 ] Adrien Grand commented on LUCENE-4903: -- This is a good idea, I didn't know of this class. I'll update the patch! > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4581) sort-order of facet-counts depends on facet.mincount
[ https://issues.apache.org/jira/browse/SOLR-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626772#comment-13626772 ] Adrien Grand commented on SOLR-4581: Thanks for fixing the bug Yonik! > sort-order of facet-counts depends on facet.mincount > > > Key: SOLR-4581 > URL: https://issues.apache.org/jira/browse/SOLR-4581 > Project: Solr > Issue Type: Bug >Affects Versions: 4.2 >Reporter: Alexander Buhr >Assignee: Yonik Seeley > Fix For: 4.3, 5.0 > > Attachments: SOLR-4581.patch, SOLR-4581.patch > > > I just upgraded to Solr 4.2 and cannot explain the following behaviour: > I am using a solr.TrieDoubleField named 'ListPrice_EUR_INV' as a facet-field. > The solr-response for the query > {noformat}'solr/Products/select?q=*%3A*&wt=xml&indent=true&facet=true&facet.field=ListPrice_EUR_INV&f.ListPrice_EUR_INV.facet.sort=index'{noformat} > includes the following facet-counts: > {noformat} > 1 > 1 > 1 > {noformat} > If I also set the parameter *'facet.mincount=1'* in the query, the order of > the facet-counts is reversed. > {noformat} > 1 > 1 > 1 > {noformat} > I would have expected, that the sort-order of the facet-counts is not > affected by the facet.mincount parameter, as it is in Solr 4.1. > Is this related to SOLR-2850? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4921) Create a DocValuesFormat for sparse doc values
Adrien Grand created LUCENE-4921: Summary: Create a DocValuesFormat for sparse doc values Key: LUCENE-4921 URL: https://issues.apache.org/jira/browse/LUCENE-4921 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Adrien Grand Priority: Trivial We could have a special DocValuesFormat in lucene/codecs to better handle sparse doc values. See http://search-lucene.com/m/HUeYW1RlEtc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order
[ https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626982#comment-13626982 ] Adrien Grand commented on LUCENE-4904: -- We can add this ReverseOrderSorter, but as far as NumericDocValuesSorter is concerned, I would rather have the abstraction at the level of the DocComparator rather than the Sorter. This would allow {{Sorter.sort(int,DocComparator)}} to quickly return null without allocating (potentially lots of) memory for the doc maps if the reader is already sorted. Additionally, this would allow for more readable diagnostics (such as "DocValues(fieldName,desc)" instead of "Reverse(DocValues(fieldName,asc))". > Sorter API: Make NumericDocValuesSorter able to sort in reverse order > - > > Key: LUCENE-4904 > URL: https://issues.apache.org/jira/browse/LUCENE-4904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Trivial > Labels: newdev > Fix For: 4.3 > > Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch > > > Today it is only able to sort in ascending order. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4903: - Attachment: LUCENE-4903.patch New patch: * borrows Robert's idea to no delegate if the method has not been overridden, * AssertingScorer.score(Collector) either calls score(Collector) or score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean() * modifies some join scorers so that nextDoc throws UOE instead of iterating out of order * adds an assertion to Scorer.score(Collector) to make sure that iteration has not started before this method is called * adds an assertion to Scorer.score(Collector, int, int) to make sure that docID() == firstDocID > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch, LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627291#comment-13627291 ] Adrien Grand edited comment on LUCENE-4903 at 4/10/13 12:05 AM: New patch: * borrows Robert's idea to not delegate if the method has not been overridden, * AssertingScorer.score(Collector) either calls score(Collector) or score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean() * modifies some join scorers so that nextDoc throws UOE instead of iterating out of order * adds an assertion to Scorer.score(Collector) to make sure that iteration has not started before this method is called * adds an assertion to Scorer.score(Collector, int, int) to make sure that docID() == firstDocID was (Author: jpountz): New patch: * borrows Robert's idea to no delegate if the method has not been overridden, * AssertingScorer.score(Collector) either calls score(Collector) or score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean() * modifies some join scorers so that nextDoc throws UOE instead of iterating out of order * adds an assertion to Scorer.score(Collector) to make sure that iteration has not started before this method is called * adds an assertion to Scorer.score(Collector, int, int) to make sure that docID() == firstDocID > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch, LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627594#comment-13627594 ] Adrien Grand commented on LUCENE-4903: -- bq. So we don't need the weak map anymore right? It could still be useful to Scorers that override {{score(Collector collector)}} and call {{collector.setScorer(this)}} in the body of this method I think. bq. maybe AssertingWeight's scorer() method should create a new Random(random.nextLong()) to pass to the AssertingScorer when it creates it? Good point. I'll update the patch. > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch, LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order
[ https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627619#comment-13627619 ] Adrien Grand commented on LUCENE-4904: -- bq. This got me thinking if ascending/descending should be on the Sorter.sort API I think it shouldn't for the reasons you mentioned. The patch looks good to me, +1 to commit! > Sorter API: Make NumericDocValuesSorter able to sort in reverse order > - > > Key: LUCENE-4904 > URL: https://issues.apache.org/jira/browse/LUCENE-4904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Trivial > Labels: newdev > Fix For: 4.3 > > Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch, > LUCENE-4904.patch > > > Today it is only able to sort in ascending order. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order
[ https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627653#comment-13627653 ] Adrien Grand commented on LUCENE-4904: -- It is OK for me. > Sorter API: Make NumericDocValuesSorter able to sort in reverse order > - > > Key: LUCENE-4904 > URL: https://issues.apache.org/jira/browse/LUCENE-4904 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Trivial > Labels: newdev > Fix For: 4.3 > > Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch, > LUCENE-4904.patch > > > Today it is only able to sort in ascending order. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
Adrien Grand created LUCENE-4924: Summary: Make DocIdSetIterator.docID() return -1 when not positioned Key: LUCENE-4924 URL: https://issues.apache.org/jira/browse/LUCENE-4924 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Priority: Minor Fix For: 5.0 Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the enum is not positioned. I would like to only allow it to return -1 so that we can have better assertions. (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
[ https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4924: Assignee: Adrien Grand > Make DocIdSetIterator.docID() return -1 when not positioned > --- > > Key: LUCENE-4924 > URL: https://issues.apache.org/jira/browse/LUCENE-4924 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 5.0 > > > Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the > enum is not positioned. I would like to only allow it to return -1 so that we > can have better assertions. > (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE
Adrien Grand created LUCENE-4925: Summary: IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE Key: LUCENE-4925 URL: https://issues.apache.org/jira/browse/LUCENE-4925 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.2.1 Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.3 When executor != null, IndexSearcher performs two passes to compute the top docs. This doesn't work when the sort contains SortField.FIELD_SCORE because the second pass doesn't have access to scores computed in the first pass. Since search(...) doesn't compute scores when there is a sort, they are all Float.NaN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE
[ https://issues.apache.org/jira/browse/LUCENE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4925: - Attachment: LUCENE-4925.patch Patch. Without the patch applied, the new test in TestSort would fail whenever LuceneTestCase.newSearcher would return a Searcher that collects segments in parallel. > IndexSearcher.search is broken when IndexSearcher.executor != null and the > sort contains SortField.FIELD_SCORE > -- > > Key: LUCENE-4925 > URL: https://issues.apache.org/jira/browse/LUCENE-4925 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.2.1 >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.3 > > Attachments: LUCENE-4925.patch > > > When executor != null, IndexSearcher performs two passes to compute the top > docs. This doesn't work when the sort contains SortField.FIELD_SCORE because > the second pass doesn't have access to scores computed in the first pass. > Since search(...) doesn't compute scores when there is a sort, they are all > Float.NaN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE
[ https://issues.apache.org/jira/browse/LUCENE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4925. -- Resolution: Fixed > IndexSearcher.search is broken when IndexSearcher.executor != null and the > sort contains SortField.FIELD_SCORE > -- > > Key: LUCENE-4925 > URL: https://issues.apache.org/jira/browse/LUCENE-4925 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 4.2.1 >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.3 > > Attachments: LUCENE-4925.patch > > > When executor != null, IndexSearcher performs two passes to compute the top > docs. This doesn't work when the sort contains SortField.FIELD_SCORE because > the second pass doesn't have access to scores computed in the first pass. > Since search(...) doesn't compute scores when there is a sort, they are all > Float.NaN. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4903) Add AssertingScorer
[ https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4903. -- Resolution: Fixed I just committed. Hopefully this will find bugs in Scorers! > Add AssertingScorer > --- > > Key: LUCENE-4903 > URL: https://issues.apache.org/jira/browse/LUCENE-4903 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-4903.patch, LUCENE-4903.patch > > > I think we would benefit from having an AssertingScorer that would assert > that scorers are advanced correctly, return valid scores (eg. not NaN), ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4911) Missing word "cela" in conf/lang/stopwords_fr.txt
[ https://issues.apache.org/jira/browse/LUCENE-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629459#comment-13629459 ] Adrien Grand commented on LUCENE-4911: -- For your information, Martin Porter (himself!) added cela to the upstream stop list (http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001466.html). > Missing word "cela" in conf/lang/stopwords_fr.txt > - > > Key: LUCENE-4911 > URL: https://issues.apache.org/jira/browse/LUCENE-4911 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 4.2 >Reporter: Pierre Kobylanski >Assignee: Adrien Grand >Priority: Trivial > Attachments: stopwords_fr.txt.patch > > Original Estimate: 10m > Remaining Estimate: 10m > > NB: Not sure this defect is assigned to the right component. > In file example/solr/collection1/conf/lang/stopwords_fr.txt, > there is the word "celà". Though incorrect in French (cf > http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add > the correct spelling (e.g. "cela", whitout accent) to that stopwords list. > Another thing: I noticed that "celà" is the only word of the list followed by > an unbreakable space. Is that wanted? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable
Adrien Grand created LUCENE-4928: Summary: Compressed stored fields: make the maximum number of docs in a chunk configurable Key: LUCENE-4928 URL: https://issues.apache.org/jira/browse/LUCENE-4928 Project: Lucene - Core Issue Type: Bug Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Fix For: 4.3 When documents are very small (a few bytes), there can be so many of them in a single chunk that merging can become very slow. Making the maximum number of documents per chunk configurable could help. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable
[ https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629604#comment-13629604 ] Adrien Grand commented on LUCENE-4928: -- I'm looking at the term vectors format, and it can't have a configurable number of documents per chunk without changing the format (it would need to store the max number of documents per chunk to be able at merging time to decide on whether it can bulk-merge the next chunk). So for now I think we can just have a hard limit and make it configurable in the future if we have a need for it? > Compressed stored fields: make the maximum number of docs in a chunk > configurable > - > > Key: LUCENE-4928 > URL: https://issues.apache.org/jira/browse/LUCENE-4928 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > > When documents are very small (a few bytes), there can be so many of them in > a single chunk that merging can become very slow. Making the maximum number > of documents per chunk configurable could help. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable
[ https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4928: - Attachment: LUCENE-4928.patch Proposed patch. > Compressed stored fields: make the maximum number of docs in a chunk > configurable > - > > Key: LUCENE-4928 > URL: https://issues.apache.org/jira/browse/LUCENE-4928 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4928.patch > > > When documents are very small (a few bytes), there can be so many of them in > a single chunk that merging can become very slow. Making the maximum number > of documents per chunk configurable could help. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-4706) LZ4.decompress() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SOLR-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned SOLR-4706: -- Assignee: Adrien Grand > LZ4.decompress() throws ArrayIndexOutOfBoundsException > --- > > Key: SOLR-4706 > URL: https://issues.apache.org/jira/browse/SOLR-4706 > Project: Solr > Issue Type: Bug > Components: search, SearchComponents - other >Affects Versions: 4.2, 4.2.1 >Reporter: Victor Ruiz >Assignee: Adrien Grand > > The exception is thrown for all components I'm using: RealTimeGetHandler, > TermVectorComponent, MoreLikethis, SearchHandler. > Here 2 trace errors: > http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq= > date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz > literatur)&rows=250 > {quote} > \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518}, > \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat > org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat > org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat > > org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat > > org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat > > org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat > > org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat > > org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat > org.mortbay.jetty.Server.handle(Server.java:326)\n\tat > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}} > {quote} > http://localhost:8984/solr/osr/get?id=105266867 > {quote} > \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver > kann man kaum würzen safran ist das teuerste gewürz der welt handverlesen > und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird > in winzigen mengen gehandelt und > verwendet","title":"safran","domainid":4287,"date_i":"2012-11-21T17:01:23Z","date":"2012-11-21T17:01:09Z","category":\["kultur","literatur","gesellschaft","umwelt","trinken","essen"]}]},"termVectors":\["uniqueKe
[jira] [Commented] (SOLR-4706) LZ4.decompress() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SOLR-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629951#comment-13629951 ] Adrien Grand commented on SOLR-4706: Thanks for reporting the issue Victor. Can you reproduce the issue if you reindex your documents? I'd be happy to take a look at the index too if you can share it with us. > LZ4.decompress() throws ArrayIndexOutOfBoundsException > --- > > Key: SOLR-4706 > URL: https://issues.apache.org/jira/browse/SOLR-4706 > Project: Solr > Issue Type: Bug > Components: search, SearchComponents - other >Affects Versions: 4.2, 4.2.1 >Reporter: Victor Ruiz > > The exception is thrown for all components I'm using: RealTimeGetHandler, > TermVectorComponent, MoreLikethis, SearchHandler. > Here 2 trace errors: > http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq= > date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz > literatur)&rows=250 > {quote} > \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518}, > \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat > org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat > org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat > > org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat > > org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat > > org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat > > org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat > > org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat > org.mortbay.jetty.Server.handle(Server.java:326)\n\tat > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}} > {quote} > http://localhost:8984/solr/osr/get?id=105266867 > {quote} > \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver > kann man kaum würzen safran ist das teuerste gewürz der welt handverlesen > und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird > in winzigen mengen gehandelt und > verwendet","title":"safran","domainid
[jira] [Updated] (SOLR-4707) LZ4.decompress() throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SOLR-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated SOLR-4707: --- Assignee: (was: Adrien Grand) > LZ4.decompress() throws ArrayIndexOutOfBoundsException > --- > > Key: SOLR-4707 > URL: https://issues.apache.org/jira/browse/SOLR-4707 > Project: Solr > Issue Type: Bug > Components: replication (java) >Affects Versions: 4.2, 4.2.1 >Reporter: Victor Ruiz > > The exception is thrown for all components I'm using: RealTimeGetHandler, > TermVectorComponent, MoreLikethis, SearchHandler. > Here 2 trace errors: > http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq= > date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz > literatur)&rows=250 > {quote} > \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518}, > \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat > org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat > org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat > > org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat > > org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat > > org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat > > org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat > > org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat > > org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat > org.mortbay.jetty.Server.handle(Server.java:326)\n\tat > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat > org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}} > {quote} > http://localhost:8984/solr/osr/tv?q=itemid:105266867 > {quote} > \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver > kann man kaum würzen safran ist das teuerste gewürz der welt handverlesen > und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird > in winzigen mengen gehandelt und > verwendet","title":"safran","domainid":4287,"date_i":"2012-11-21T17:01:23Z","date":"2012-11-21T17:01:09Z","category":\["kultur","literatur","gesellschaft","umwelt","trinken","essen"]}]},"termVectors":\["uniqueKeyFieldName","itemid","105266867",["uniqu
[jira] [Updated] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
[ https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4924: - Attachment: LUCENE-4924.patch Patch. > Make DocIdSetIterator.docID() return -1 when not positioned > --- > > Key: LUCENE-4924 > URL: https://issues.apache.org/jira/browse/LUCENE-4924 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-4924.patch > > > Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the > enum is not positioned. I would like to only allow it to return -1 so that we > can have better assertions. > (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
[ https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4924: - Attachment: LUCENE-4924.patch Thanks Robert, I ran lucene tests and they all passed. I updated the patch to make the CHANGES entry clearer. > Make DocIdSetIterator.docID() return -1 when not positioned > --- > > Key: LUCENE-4924 > URL: https://issues.apache.org/jira/browse/LUCENE-4924 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch > > > Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the > enum is not positioned. I would like to only allow it to return -1 so that we > can have better assertions. > (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
[ https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631697#comment-13631697 ] Adrien Grand commented on LUCENE-4924: -- I plan to commit soon and backport everything to 4.x but the changes entry and the DocIdSetIterator.docID() javadoc change. > Make DocIdSetIterator.docID() return -1 when not positioned > --- > > Key: LUCENE-4924 > URL: https://issues.apache.org/jira/browse/LUCENE-4924 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch > > > Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the > enum is not positioned. I would like to only allow it to return -1 so that we > can have better assertions. > (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable
[ https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4928. -- Resolution: Fixed > Compressed stored fields: make the maximum number of docs in a chunk > configurable > - > > Key: LUCENE-4928 > URL: https://issues.apache.org/jira/browse/LUCENE-4928 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-4928.patch > > > When documents are very small (a few bytes), there can be so many of them in > a single chunk that merging can become very slow. Making the maximum number > of documents per chunk configurable could help. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned
[ https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4924. -- Resolution: Fixed Thank you Robert and Yonik! > Make DocIdSetIterator.docID() return -1 when not positioned > --- > > Key: LUCENE-4924 > URL: https://issues.apache.org/jira/browse/LUCENE-4924 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Fix For: 5.0 > > Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch > > > Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the > enum is not positioned. I would like to only allow it to return -1 so that we > can have better assertions. > (This proposal is for trunk only.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4934) AssertingIndexSearcher should do basic QueryUtils/etc checks on every query
[ https://issues.apache.org/jira/browse/LUCENE-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631824#comment-13631824 ] Adrien Grand commented on LUCENE-4934: -- +1 > AssertingIndexSearcher should do basic QueryUtils/etc checks on every query > --- > > Key: LUCENE-4934 > URL: https://issues.apache.org/jira/browse/LUCENE-4934 > Project: Lucene - Core > Issue Type: Test >Reporter: Robert Muir > > We can start with QueryUtils.check(query): which does some basic > hashcode/equals checks. > Ideally we'd strengthen the checks as we fix problems: e.g. add explanations > verifications (checkExplanations) and then finally the more intense check() > that does more verifications with deleted docs/next/advance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4936: Assignee: Adrien Grand > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4937) sort order different in branch_4x than trunk
[ https://issues.apache.org/jira/browse/LUCENE-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634033#comment-13634033 ] Adrien Grand commented on LUCENE-4937: -- Thanks Uwe! > sort order different in branch_4x than trunk > > > Key: LUCENE-4937 > URL: https://issues.apache.org/jira/browse/LUCENE-4937 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Assignee: Uwe Schindler > Fix For: 4.3 > > Attachments: LUCENE-4937.patch, LUCENE-4937.patch, > LUCENE-4937_test.patch, SOLR-4723_test.patch > > > I will buy a beer to whoever figures out why +0 sorts before -0 in branch_4x, > but works correctly in trunk :) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch Patch: * Adds MathUtil.gcd(long, long) * Adds "GCD compression" to Lucene42, Disk and CheapBastard. * Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" with Lucene42DVF * No more attempts to compress storage when the values are known to be dense, such as SORTED ords. I measured how slower doc values indexing is with these new checks, and it is completely unnoticeable with random or dense values since the GCD quickly reaches 1. When the GCD is larger, it only made indexing 2% slower (every doc has a single field which is a NumericDocValuesField). So I think it's fine. > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch New patch: * Computes the GCD based on deltas in order to be able to compress non-UTC dates. * Adds support for TABLE_COMPRESSED to DiskDVF. * Adds tests that ensure that these new compression methods are actually used whenever applicable. * Adds a quick description of the compression method to Lucene42DVF javadocs. > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636406#comment-13636406 ] Adrien Grand commented on LUCENE-4936: -- Thank you Uwe! Unfortunately, I just figured out that the patch is broken when v - minValue overflows (in Consumer.addNumericField). I need to think about a way to fix it... > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Fix Version/s: 4.4 > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch Here is a work-around for the issue: the consumer stops trying to perform GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 - MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of a reasonable use-case that would benefit from GCD compression and have values outside of this range? > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636488#comment-13636488 ] Adrien Grand edited comment on LUCENE-4936 at 4/19/13 3:31 PM: --- Here is a work-around for the issue: the consumer stops trying to perform GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 , MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of a reasonable use-case that would benefit from GCD compression and have values outside of this range? was (Author: jpountz): Here is a work-around for the issue: the consumer stops trying to perform GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 - MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of a reasonable use-case that would benefit from GCD compression and have values outside of this range? > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636503#comment-13636503 ] Adrien Grand commented on LUCENE-4936: -- Thank you Robert, I'd love to have a review to make sure the patch is correct, especially for MathUtil.gcd and the DVConsumer.addNumericField logic. > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch Simple ideas are often the best ones, the new patch has a single loop! Thanks Robert! > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch +1 to the proposed changes! Here is an updated patch that fixes the DVProducer constructors to open the data file and check the header in a try/finally block (so that data files are closed even if the header check fails). > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: LUCENE-4936.patch +1 to the proposed changes! Here is an updated patch that fixes the DVProducer constructors to open the data file and check the header in a try/finally block (so that data files are closed even if the header check fails). > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Attachment: (was: LUCENE-4936.patch) > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4936: - Comment: was deleted (was: +1 to the proposed changes! Here is an updated patch that fixes the DVProducer constructors to open the data file and check the header in a try/finally block (so that data files are closed even if the header check fails).) > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4946) Refactor SorterTemplate
Adrien Grand created LUCENE-4946: Summary: Refactor SorterTemplate Key: LUCENE-4946 URL: https://issues.apache.org/jira/browse/LUCENE-4946 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial When working on TimSort (LUCENE-4839), I was a little frustrated of not being able to add galloping support because it would have required to add new primitive operations in addition to compare and swap. I started working on a prototype that uses inheritance to allow some sorting algorithms to rely on additional primitive operations. You can have a look at https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but beware it is a prototype and still misses proper documentation and good tests). I think it would offer several advantages: - no more need to implement setPivot and comparePivot when using in-place merge sort or insertion sort, - the ability to use faster stable sorting algorithms at the cost of some memory overhead (our in-place merge sort is very slow), - the ability to implement properly algorithms that are useful on specific datasets but require different primitive operations (such as TimSort for partially-sorted data). If you are interested in comparing these implementations with Arrays.sort, there is a Benchmark class in src/examples. What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638090#comment-13638090 ] Adrien Grand commented on LUCENE-4936: -- I guess the point was to avoid one level of indirection in case all values can be stored using a single byte. Maybe "(maxValue - minValue) > 256" should be replaced with "(maxValue - minValue) >= uniqueValues.size()"? This would ensure that table compression isn't used if values are alreadu dense? > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638114#comment-13638114 ] Adrien Grand commented on LUCENE-4936: -- One advantage of DELTA_COMPRESSED is that it uses different numbers of bits per value per block. Even if max-min=200, it could still happen that most blocks only require 6 or 7 bits per value. If there are many blocks, this could save substantial disk/memory. > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638117#comment-13638117 ] Adrien Grand commented on LUCENE-4936: -- bq. In this case should we just take bitsRequired on both sides? Yes, this makes sense ! > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4955) NGramTokenFilter increments positions for each gram
[ https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641567#comment-13641567 ] Adrien Grand commented on LUCENE-4955: -- Given that offsets can't go backwards and that tokens in the same position must have the same start offset, I think that the only way to get NGramTokenFilter out of TestRandomChains' exclusion list (LUCENE-4641) is to fix position increments (this issue), change the order tokens are emitted in (LUCENE-3920) and stop modifying offsets? I know some people rely on the current behavior but I think it's more important to get this filter out of TestRandomChains' exclusions since it causes highlighting bugs and makes the term vectors files unnecessary larger. > NGramTokenFilter increments positions for each gram > --- > > Key: LUCENE-4955 > URL: https://issues.apache.org/jira/browse/LUCENE-4955 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Simon Willnauer > Fix For: 5.0, 4.4 > > Attachments: highlighter-test.patch, LUCENE-4955.patch > > > NGramTokenFilter increments positions for each gram rather for the actual > token which can lead to rather funny problems especially with highlighting. > if this filter should be used for highlighting is a different story but today > this seems to be a common practice in many situations to highlight sub-term > matches. > I have a test for highlighting that uses ngram failing with a StringIOOB > since tokens are sorted by position which causes offsets to be mixed up due > to ngram token filter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4955) NGramTokenFilter increments positions for each gram
[ https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641706#comment-13641706 ] Adrien Grand commented on LUCENE-4955: -- +1 I'll work on fixing NGramTokenizer and NGramTokenFilter. > NGramTokenFilter increments positions for each gram > --- > > Key: LUCENE-4955 > URL: https://issues.apache.org/jira/browse/LUCENE-4955 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Simon Willnauer > Fix For: 5.0, 4.4 > > Attachments: highlighter-test.patch, highlighter-test.patch, > LUCENE-4955.patch > > > NGramTokenFilter increments positions for each gram rather for the actual > token which can lead to rather funny problems especially with highlighting. > if this filter should be used for highlighting is a different story but today > this seems to be a common practice in many situations to highlight sub-term > matches. > I have a test for highlighting that uses ngram failing with a StringIOOB > since tokens are sorted by position which causes offsets to be mixed up due > to ngram token filter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4959) Incorrect return value from SimpleNaiveBayesClassifier.assignClass
[ https://issues.apache.org/jira/browse/LUCENE-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4959: Assignee: Adrien Grand > Incorrect return value from SimpleNaiveBayesClassifier.assignClass > --- > > Key: LUCENE-4959 > URL: https://issues.apache.org/jira/browse/LUCENE-4959 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.0, 4.2.1 >Reporter: Alexey Kutin >Assignee: Adrien Grand > Labels: classification > > The local copy of BytesRef referenced by foundClass is affected by subsequent > TermsEnum.iterator.next() calls as the shared BytesRef.bytes changes. > If a term "test" gives a good match and a next term in the terms collection > is "classification" with a lower match score then the return result will be > "clas" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4957) Stop IndexWriter from writing broken term vector offset data in 5.0
[ https://issues.apache.org/jira/browse/LUCENE-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642201#comment-13642201 ] Adrien Grand commented on LUCENE-4957: -- +1 > Stop IndexWriter from writing broken term vector offset data in 5.0 > --- > > Key: LUCENE-4957 > URL: https://issues.apache.org/jira/browse/LUCENE-4957 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir > > Today we allow this in (some analyzers are broken), and only reject them if > someone is indexing offsets into the postings lists. > But we should ban this also when term vectors are enabled. Its time to stop > writing this broken data and let broken analyzers be broken. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4955) NGramTokenFilter increments positions for each gram
[ https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4955: - Attachment: LUCENE-4955.patch I tried to iterate on Simon's patch: * NGramTokenFilter doesn't modify offsets and emits all n-grams of a single term at the same position * NGramTokenizer uses a sliding window. * NGramTokenizer and NGramTokenFilter removed from TestRandomChains exclusions. It was very hard to add the compatibility version support to NGramTokenizer so there are now two distinct classes and the factory picks the right one depending on the Lucene match version. Simon's highlighting test now fails because the highlighted content is different, but not because of a broken token stream. > NGramTokenFilter increments positions for each gram > --- > > Key: LUCENE-4955 > URL: https://issues.apache.org/jira/browse/LUCENE-4955 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Simon Willnauer > Fix For: 5.0, 4.4 > > Attachments: highlighter-test.patch, highlighter-test.patch, > LUCENE-4955.patch, LUCENE-4955.patch > > > NGramTokenFilter increments positions for each gram rather for the actual > token which can lead to rather funny problems especially with highlighting. > if this filter should be used for highlighting is a different story but today > this seems to be a common practice in many situations to highlight sub-term > matches. > I have a test for highlighting that uses ngram failing with a StringIOOB > since tokens are sorted by position which causes offsets to be mixed up due > to ngram token filter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4955) NGramTokenFilter increments positions for each gram
[ https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4955. -- Resolution: Fixed > NGramTokenFilter increments positions for each gram > --- > > Key: LUCENE-4955 > URL: https://issues.apache.org/jira/browse/LUCENE-4955 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Simon Willnauer > Fix For: 5.0, 4.4 > > Attachments: highlighter-test.patch, highlighter-test.patch, > LUCENE-4955.patch, LUCENE-4955.patch > > > NGramTokenFilter increments positions for each gram rather for the actual > token which can lead to rather funny problems especially with highlighting. > if this filter should be used for highlighting is a different story but today > this seems to be a common practice in many situations to highlight sub-term > matches. > I have a test for highlighting that uses ngram failing with a StringIOOB > since tokens are sorted by position which causes offsets to be mixed up due > to ngram token filter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4955) NGramTokenFilter increments positions for each gram
[ https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4955: Assignee: Adrien Grand > NGramTokenFilter increments positions for each gram > --- > > Key: LUCENE-4955 > URL: https://issues.apache.org/jira/browse/LUCENE-4955 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.3 >Reporter: Simon Willnauer >Assignee: Adrien Grand > Fix For: 5.0, 4.4 > > Attachments: highlighter-test.patch, highlighter-test.patch, > LUCENE-4955.patch, LUCENE-4955.patch > > > NGramTokenFilter increments positions for each gram rather for the actual > token which can lead to rather funny problems especially with highlighting. > if this filter should be used for highlighting is a different story but today > this seems to be a common practice in many situations to highlight sub-term > matches. > I have a test for highlighting that uses ngram failing with a StringIOOB > since tokens are sorted by position which causes offsets to be mixed up due > to ngram token filter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-3920) ngram tokenizer/filters create nonsense offsets if followed by a word combiner
[ https://issues.apache.org/jira/browse/LUCENE-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-3920. -- Resolution: Fixed Assignee: Adrien Grand Fixed by LUCENE-4955. > ngram tokenizer/filters create nonsense offsets if followed by a word combiner > -- > > Key: LUCENE-3920 > URL: https://issues.apache.org/jira/browse/LUCENE-3920 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 3.6, 4.0-ALPHA >Reporter: Robert Muir >Assignee: Adrien Grand > Attachments: LUCENE-3920_test.patch > > > It seems like maybe its possibly applying the offsets from the wrong token? > Because after shingling, the resulting token has a startOffset thats after > the endoffset. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-1227) NGramTokenizer to handle more than 1024 chars
[ https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-1227. -- Resolution: Fixed LUCENE-4955 fixed this issue. > NGramTokenizer to handle more than 1024 chars > - > > Key: LUCENE-1227 > URL: https://issues.apache.org/jira/browse/LUCENE-1227 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Hiroaki Kawai >Priority: Minor > Attachments: LUCENE-1227.patch, NGramTokenizer.patch, > NGramTokenizer.patch > > > Current NGramTokenizer can't handle character stream that is longer than > 1024. This is too short for non-whitespace-separated languages. > I created a patch for this issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-2947) NGramTokenizer shouldn't trim whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-2947. -- Resolution: Fixed NGramTokenizer doesn't trim whitespaces anymore (LUCENE-4955). > NGramTokenizer shouldn't trim whitespace > > > Key: LUCENE-2947 > URL: https://issues.apache.org/jira/browse/LUCENE-2947 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 3.0.3 >Reporter: David Byrne >Priority: Minor > Attachments: LUCENE-2947.patch, NGramTokenizerTest.java > > > Before I tokenize my strings, I am padding them with white space: > String foobar = " " + foo + " " + bar + " "; > When constructing term vectors from ngrams, this strategy has a couple > benefits. First, it places special emphasis on the starting and ending of a > word. Second, it improves the similarity between phrases with swapped words. > " foo bar " matches " bar foo " more closely than "foo bar" matches "bar > foo". > The problem is that Lucene's NGramTokenizer trims whitespace. This forces me > to do some preprocessing on my strings before I can tokenize them: > foobar.replaceAll(" ","$"); //arbitrary char not in my data > This is undocumented, so users won't realize their strings are being > trim()'ed, unless they look through the source, or examine the tokens > manually. > I am proposing NGramTokenizer should be changed to respect whitespace. Is > there a compelling reason against this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-1224) NGramTokenFilter creates bad TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-1224. -- Resolution: Fixed All n-grams now have the same position and offsets as the original token (LUCENE-4955). > NGramTokenFilter creates bad TokenStream > > > Key: LUCENE-1224 > URL: https://issues.apache.org/jira/browse/LUCENE-1224 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Reporter: Hiroaki Kawai >Priority: Minor > Fix For: 4.3 > > Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, > NGramTokenFilter.patch > > > With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string > into an index, but I can't query it with "abc". If I query with "ab", I can > get a hit result. > The reason is that the NGramTokenFilter generates badly ordered TokenStream. > Query is based on the Token order in the TokenStream, that how stemming or > phrase should be anlayzed is based on the order (Token.positionIncrement). > With current filter, query string "abc" is tokenized to : ab bc abc > meaning "query a string that has ab bc abc in this order". > Expected filter will generate : ab abc(positionIncrement=0) bc > meaning "query a string that has (ab|abc) bc in this order" > I'd like to submit a patch for this issue. :-) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-1227) NGramTokenizer to handle more than 1024 chars
[ https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643326#comment-13643326 ] Adrien Grand commented on LUCENE-1227: -- David, sorry I didn't know about your patch and happened to fix this issue as part of LUCENE-4955. Your patch seems to operate very similarly and adds supports for whitespace collapsing, is that correct? Don't hesitate to tell me if you think the current implementation needs improvements. > NGramTokenizer to handle more than 1024 chars > - > > Key: LUCENE-1227 > URL: https://issues.apache.org/jira/browse/LUCENE-1227 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis >Reporter: Hiroaki Kawai >Priority: Minor > Attachments: LUCENE-1227.patch, NGramTokenizer.patch, > NGramTokenizer.patch > > > Current NGramTokenizer can't handle character stream that is longer than > 1024. This is too short for non-whitespace-separated languages. > I created a patch for this issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4963) Deprecate broken TokenFilter constructors
Adrien Grand created LUCENE-4963: Summary: Deprecate broken TokenFilter constructors Key: LUCENE-4963 URL: https://issues.apache.org/jira/browse/LUCENE-4963 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Fix For: 4.4 We have some TokenFilters which are only broken with specific options. This includes: * TrimFilter when updateOffsets=true * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, TypeTokenFilter when enablePositionIncrements=false I think we should deprecate these behaviors in 4.4 and remove them in trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4959) Incorrect return value from SimpleNaiveBayesClassifier.assignClass
[ https://issues.apache.org/jira/browse/LUCENE-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4959. -- Resolution: Fixed Thanks Alexey! > Incorrect return value from SimpleNaiveBayesClassifier.assignClass > --- > > Key: LUCENE-4959 > URL: https://issues.apache.org/jira/browse/LUCENE-4959 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 5.0, 4.2.1 >Reporter: Alexey Kutin >Assignee: Adrien Grand > Labels: classification > Attachments: LUCENE-4959.patch > > > The local copy of BytesRef referenced by foundClass is affected by subsequent > TermsEnum.iterator.next() calls as the shared BytesRef.bytes changes. > If a term "test" gives a good match and a next term in the terms collection > is "classification" with a lower match score then the return result will be > "clas" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4966) Add CachingWrapperFilter.sizeInBytes()
[ https://issues.apache.org/jira/browse/LUCENE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644512#comment-13644512 ] Adrien Grand commented on LUCENE-4966: -- +1 I wish we had such methods for the terms index, norms/doc values, stored fields/term vectors index, etc. too in order to get a better understanding of how Lucene uses memory. > Add CachingWrapperFilter.sizeInBytes() > -- > > Key: LUCENE-4966 > URL: https://issues.apache.org/jira/browse/LUCENE-4966 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 5.0, 4.4 >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-4966.patch > > > I think it's useful to be able to check how much RAM a given CWF is using ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4936) docvalues date compression
[ https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4936. -- Resolution: Fixed > docvalues date compression > -- > > Key: LUCENE-4936 > URL: https://issues.apache.org/jira/browse/LUCENE-4936 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Robert Muir >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, > LUCENE-4936.patch, LUCENE-4936.patch > > > DocValues fields can be very wasteful if you are storing dates (like solr's > TrieDateField does if you enable docvalues) and don't actually need all the > precision: e.g. "date-only" fields like date of birth with no time component, > time fields without milliseconds precision, and so on. > Ideally we'd compute GCD of all the values to save space > (numberOfTrailingZeros is not really enough here), but i think we should at > least look for values like 8640, 360, and 1000 to be practical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4963) Deprecate broken TokenFilter constructors
[ https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4963: - Attachment: LUCENE-4963.patch Thanks Uwe for the advice. Here is a first patch: * Deprecate constructors that expose broken options and make them throw an IllegalArgumentException when the lucene match version is >= 4.4 * Remove the same constructors from TestRandomChains' exclusion list. * Since enablePositionIncrements=true was used by the Analyzing and Fuzzy suggesters to ignore position holes, I had to make it an option in the suggesters themselves instead of the token streams. * More documentation in the oal.analysis package: PositionLengthAttribute and guidelines on writing non-corrupt token streams. > Deprecate broken TokenFilter constructors > - > > Key: LUCENE-4963 > URL: https://issues.apache.org/jira/browse/LUCENE-4963 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4963.patch > > > We have some TokenFilters which are only broken with specific options. This > includes: > * TrimFilter when updateOffsets=true > * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, > TypeTokenFilter when enablePositionIncrements=false > I think we should deprecate these behaviors in 4.4 and remove them in trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4963) Deprecate broken TokenFilter constructors
[ https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645378#comment-13645378 ] Adrien Grand commented on LUCENE-4963: -- Hi Uwe, thanks for doing the review! The patch applies to trunk and I plan to remove deprecations in a second step. Is it OK with you? > Deprecate broken TokenFilter constructors > - > > Key: LUCENE-4963 > URL: https://issues.apache.org/jira/browse/LUCENE-4963 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4963.patch > > > We have some TokenFilters which are only broken with specific options. This > includes: > * TrimFilter when updateOffsets=true > * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, > TypeTokenFilter when enablePositionIncrements=false > I think we should deprecate these behaviors in 4.4 and remove them in trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4970) NGramPhraseQuery is not boosted.
[ https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645442#comment-13645442 ] Adrien Grand commented on LUCENE-4970: -- Hi Shingo, you are right. NGramPhraseQuery.rewrite should propagate the boost to the rewritten query. Would yo like to submit a patch? (see http://wiki.apache.org/lucene-java/HowToContribute) > NGramPhraseQuery is not boosted. > > > Key: LUCENE-4970 > URL: https://issues.apache.org/jira/browse/LUCENE-4970 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.1 >Reporter: Shingo Sasaki > > If I apply setBoost() method to NGramPhraseQuery, Score will not change. > I think, setBoost() is forgatten after optimized in rewrite() method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-4970) NGramPhraseQuery is not boosted.
[ https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reassigned LUCENE-4970: Assignee: Adrien Grand > NGramPhraseQuery is not boosted. > > > Key: LUCENE-4970 > URL: https://issues.apache.org/jira/browse/LUCENE-4970 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.1 >Reporter: Shingo Sasaki >Assignee: Adrien Grand > > If I apply setBoost() method to NGramPhraseQuery, Score will not change. > I think, setBoost() is forgatten after optimized in rewrite() method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4970) NGramPhraseQuery is not boosted.
[ https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4970. -- Resolution: Fixed Committed, thank you Shingo! > NGramPhraseQuery is not boosted. > > > Key: LUCENE-4970 > URL: https://issues.apache.org/jira/browse/LUCENE-4970 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.1 >Reporter: Shingo Sasaki >Assignee: Adrien Grand > Attachments: LUCENE-4970.patch > > > If I apply setBoost() method to NGramPhraseQuery, Score will not change. > I think, setBoost() is forgatten after optimized in rewrite() method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4970) NGramPhraseQuery is not boosted.
[ https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4970: - Fix Version/s: 4.4 > NGramPhraseQuery is not boosted. > > > Key: LUCENE-4970 > URL: https://issues.apache.org/jira/browse/LUCENE-4970 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 4.1 >Reporter: Shingo Sasaki >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4970.patch > > > If I apply setBoost() method to NGramPhraseQuery, Score will not change. > I think, setBoost() is forgatten after optimized in rewrite() method. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4946: - Attachment: LUCENE-4946.patch This patch contains one base class Sorter and 3 implementations: * IntroSorter (improved quicksort like we had before but I think the name is better since it makes it clear that the worst case complexity is O(n ln(n)) instead of O(n^2) as with traditional quicksort * InPlaceMergeSort, the merge sort we had before. * TimSort, an improved version of the previous implementation that can gallop to make sorting even faster on partially-sorted data. One major difference is that the end offsets are now exclusive. I tend to find it less confusing since you would now call {{sort(0, array.length)}} instead of {{sort(0, array.length - 1)}}. Please let me know if you would like to review the patch! > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Attachments: LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4946: - Attachment: LUCENE-4946.patch Add missing @lucene.internal. > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Attachments: LUCENE-4946.patch, LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648271#comment-13648271 ] Adrien Grand commented on LUCENE-4946: -- bq. Its also useful for other projects, so its maybe a good idea to make a Apache Commons projects out of it. Why not. Or maybe use an already existing commons project such as commons collections? I'll dig that... bq. I found some code duplication I'll fix that. The reason is that I modified ArrayUtil and CollectionUtil which have their own private Sorter implementations and then I added tests which required me to have concrete implementations in src/test. I'll merge them. bq. We should remove the following from NOTICE.txt I'll fix that too. bq. Perhaps the best way to change it would be to give (startIndex, elementsCount) which still reads (0, array.length) in most cases and does not have the problems mentioned above... I have no strong opinion about that. I think the reason I like the (from,to) option better is that List.subList and Arrays.copyOfRange have the same arguments. For example someone who wants to sort a sub-list with the JDK would do {{Collections.sort(list.subList(from,to))}}. So I think it'd be nice to make directly translatable to {{new InPlaceMergeSorter() \{ compare/swap \}.sort(from, to)}}. > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Attachments: LUCENE-4946.patch, LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-4946: - Attachment: LUCENE-4946.patch New Patch: * no more code duplication between ArrayUtil and the test classes * ArrayUtil exposes a NATURAL_COMPARATOR to sort arrays based on the natural order (for objects that implement Comparable) * Removed references to CGlib in the NOTICE. > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648403#comment-13648403 ] Adrien Grand commented on LUCENE-4946: -- bq. make a Apache Commons projects out of it I just left an email on their dev@ mailing-list to get their opinion about it: http://markmail.org/message/if5cgarhavzuy45j. > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4977) Forbidden-apis: avoid calls to Collections.sort
Adrien Grand created LUCENE-4977: Summary: Forbidden-apis: avoid calls to Collections.sort Key: LUCENE-4977 URL: https://issues.apache.org/jira/browse/LUCENE-4977 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Priority: Minor Collections.sort works by dumping its content into an array, sorting it with Arrays.sort and then getting the elements back into the list. On the contrary, CollectionUtil has the ability to sort in-place when the list supports random-access, this is more memory-efficient and maybe even faster in some cases. We could use the forbidden-apis tool to prevent our code from calling Collections.sort. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4946) Refactor SorterTemplate
[ https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4946. -- Resolution: Fixed Fix Version/s: 4.4 > Refactor SorterTemplate > --- > > Key: LUCENE-4946 > URL: https://issues.apache.org/jira/browse/LUCENE-4946 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Trivial > Fix For: 4.4 > > Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch > > > When working on TimSort (LUCENE-4839), I was a little frustrated of not being > able to add galloping support because it would have required to add new > primitive operations in addition to compare and swap. > I started working on a prototype that uses inheritance to allow some sorting > algorithms to rely on additional primitive operations. You can have a look at > https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but > beware it is a prototype and still misses proper documentation and good > tests). > I think it would offer several advantages: > - no more need to implement setPivot and comparePivot when using in-place > merge sort or insertion sort, > - the ability to use faster stable sorting algorithms at the cost of some > memory overhead (our in-place merge sort is very slow), > - the ability to implement properly algorithms that are useful on specific > datasets but require different primitive operations (such as TimSort for > partially-sorted data). > If you are interested in comparing these implementations with Arrays.sort, > there is a Benchmark class in src/examples. > What do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4963) Deprecate broken TokenFilter constructors
[ https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648559#comment-13648559 ] Adrien Grand commented on LUCENE-4963: -- I'll commit this soon unless someone objects. > Deprecate broken TokenFilter constructors > - > > Key: LUCENE-4963 > URL: https://issues.apache.org/jira/browse/LUCENE-4963 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4963.patch > > > We have some TokenFilters which are only broken with specific options. This > includes: > * TrimFilter when updateOffsets=true > * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, > TypeTokenFilter when enablePositionIncrements=false > I think we should deprecate these behaviors in 4.4 and remove them in trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-4963) Deprecate broken TokenFilter constructors
[ https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-4963. -- Resolution: Fixed Thank you Uwe! > Deprecate broken TokenFilter constructors > - > > Key: LUCENE-4963 > URL: https://issues.apache.org/jira/browse/LUCENE-4963 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Fix For: 4.4 > > Attachments: LUCENE-4963.patch > > > We have some TokenFilters which are only broken with specific options. This > includes: > * TrimFilter when updateOffsets=true > * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, > TypeTokenFilter when enablePositionIncrements=false > I think we should deprecate these behaviors in 4.4 and remove them in trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813164#comment-16813164 ] Adrien Grand commented on LUCENE-8753: -- bq. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. I think this is expected. Query processing needs to look up the term in the terms dict and then process documents that contain this term. When the index gets larger, postings usually grow more quickly than the terms dictionary, so processing postings takes more time relatively compared to looking up the term in the terms dictionary. Term dictionary lookup performance only really matters for queries that have few matches (which you somehow simulated by running the benchmark on wikimedium500k) and updates, which are simulated by the PKLookup task. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?
[ https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813199#comment-16813199 ] Adrien Grand commented on LUCENE-8708: -- Thanks Atri for giving it a try! This change is a bit too invasive to my taste given that this is only a nice feature to have. That said I don't really have ideas how to make it better... > Can we simplify conjunctions of range queries automatically? > > > Key: LUCENE-8708 > URL: https://issues.apache.org/jira/browse/LUCENE-8708 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: interval_range_clauses_merging0704.patch > > > BooleanQuery#rewrite already has some logic to make queries more efficient, > such as deduplicating filters or rewriting boolean queries that wrap a single > positive clause to that clause. > It would be nice to also simplify conjunctions of range queries, so that eg. > {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. > When constructing queries manually or via the classic query parser, it feels > unnecessary as this is something that the user can fix easily. However if you > want to implement a query parser that only allows specifying one bound at > once, such as Gmail ({{after:2018-12-31}} > https://support.google.com/mail/answer/7190?hl=en) or GitHub > ({{updated:>=2018-12-31}} > https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated) > then you might end up with inefficient queries if the end user specifies > both an upper and a lower bound. It would be nice if we optimized those > automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7386) Flatten nested disjunctions
[ https://issues.apache.org/jira/browse/LUCENE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813211#comment-16813211 ] Adrien Grand commented on LUCENE-7386: -- For the record I had to disable the verification of scores for this run of the benchmark since this change removes intermediate casts to float which trigger slight changes in the produced scores. > Flatten nested disjunctions > --- > > Key: LUCENE-7386 > URL: https://issues.apache.org/jira/browse/LUCENE-7386 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7386.patch, LUCENE-7386.patch, LUCENE-7386.patch > > > Now that coords are gone it became easier to flatten nested disjunctions. It > might sound weird to write nested disjunctions in the first place, but > disjunctions can be created implicitly by other queries such as > more-like-this, LatLonPoint.newBoxQuery, non-scoring synonym queries, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11
[ https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813346#comment-16813346 ] Adrien Grand commented on LUCENE-8738: -- There seems to be issues with links to the standard API. I wonder that it might be related to the move from package-list to element-list. > Bump minimum Java version requirement to 11 > --- > > Key: LUCENE-8738 > URL: https://issues.apache.org/jira/browse/LUCENE-8738 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Reporter: Adrien Grand >Priority: Minor > Labels: Java11 > Fix For: master (9.0) > > > See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11
[ https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813445#comment-16813445 ] Adrien Grand commented on LUCENE-8738: -- Apparently the issue can be worked around by calling the file package-list locally, even though it is supposed to be called element-list with the move to modules. I'll push a fix shortly. > Bump minimum Java version requirement to 11 > --- > > Key: LUCENE-8738 > URL: https://issues.apache.org/jira/browse/LUCENE-8738 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Reporter: Adrien Grand >Priority: Minor > Labels: Java11 > Fix For: master (9.0) > > > See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11
[ https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813565#comment-16813565 ] Adrien Grand commented on LUCENE-8738: -- Sorry Uwe, I don't understand what you are suggesting. > Bump minimum Java version requirement to 11 > --- > > Key: LUCENE-8738 > URL: https://issues.apache.org/jira/browse/LUCENE-8738 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Reporter: Adrien Grand >Priority: Minor > Labels: Java11 > Fix For: master (9.0) > > > See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-8619) Decrease I/O pressure of OfflineSorter
[ https://issues.apache.org/jira/browse/LUCENE-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-8619. -- Resolution: Not A Problem This isn't a problem anymore now that Ignacio rewrote the merging of BKD trees as a selection problem rathen than a sorting problem. > Decrease I/O pressure of OfflineSorter > -- > > Key: LUCENE-8619 > URL: https://issues.apache.org/jira/browse/LUCENE-8619 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > OfflineSorter is likely I/O bound, yet it doesn't really try to relieve I/O. > For instance it always writes the length on 2 bytes, which is waseful when > used by BKDWriter since all byte[] arrays have exactly the same length. For > LatLonPoint, this is a 25% space overhead that we could remove. > Doing lightweight compression on the fly might also help. > As a data point, Ignacio told me that after indexing 60M shapes with > LatLonShape (1.65B triangles), the index directory was about 265GB and > dropped to 57GB when merging was over. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8759) BlockMaxConjunctionScorer's simplified way of computing max scores hurts performance
Adrien Grand created LUCENE-8759: Summary: BlockMaxConjunctionScorer's simplified way of computing max scores hurts performance Key: LUCENE-8759 URL: https://issues.apache.org/jira/browse/LUCENE-8759 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand BlockMaxConjunctionScorer computes the minimum value that the score should have after each scorer in order to be able to interrupt scorer as soon as possible. For instance say scorers A, B and C produce maximum scores that are equal to 4, 2 and 1. If the minimum competitive score is X, then the score after scoring A, B and C must be at least X, the score after scoring A and B must be at least X-1 and the score after scoring A must be at least X-1-2. However this is made a bit more complex than that due to floating-point numbers and the fact that intermediate score values are doubles which only get casted to a float after all values have been summed up. In order to keep things simple, BlockMaxConjunctionScore has the following comment and code {code} // Also compute the minimum required scores for a hit to be competitive // A double that is less than 'score' might still be converted to 'score' // when casted to a float, so we go to the previous float to avoid this issue minScores[minScores.length - 1] = minScore > 0 ? Math.nextDown(minScore) : 0; {code} It simplifies the problem by calling Math.nextDown(minScore). However this is problematic because it defeats the fact that TopScoreDocCollector calls setMinCompetitiveScore on the float value that is immediately greater than the k-th greatest hit so far. nextDown(minScore) is not the value that we need. The value that we need is the smallest double that converts to minScore when casted to a float, which would be half-way between nextDown(minScore) and minScore. In some cases this would help get better performance out of conjunctions, especially if some clauses produce constant scores. MaxScoreSumPropagator#setMinCompetitiveScore has the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8760) Reconsider the best way to encode postings now that we can skip non-competitive hits
Adrien Grand created LUCENE-8760: Summary: Reconsider the best way to encode postings now that we can skip non-competitive hits Key: LUCENE-8760 URL: https://issues.apache.org/jira/browse/LUCENE-8760 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand The fact that we now skip non competitive hits has some implications to our postings: - we are now more likely to call advance vs. nextDoc - we are less likely to read term frequency for a given doc, since we only do that if the maximum score reported by impacts is competitive - we are less likely to read positions for a given doc, since exact phrase queries first check the maximum score that would be obtained with a phrase freq equal to the minimum of all term freqs It might be a good opportunity to re-explore the best way to encode postings. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts
Adrien Grand created LUCENE-8762: Summary: Lucene50PostingsReader should specialize reading docs+freqs with impacts Key: LUCENE-8762 URL: https://issues.apache.org/jira/browse/LUCENE-8762 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Currently if you ask for impacts, we only have one implementation that is able to expose everything: docs, freqs, positions and offsets. In contrast, if you don't need impacts, we have specialization for docs+freqs, docs+freqs+positions and docs+freqs+positions+offsets. Maybe we should add specialization for the docs+freqs case with impacts, which should be the most common case, and remove specialization for docs+freqs+positions when impacts are not requested? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11
[ https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814577#comment-16814577 ] Adrien Grand commented on LUCENE-8738: -- [~thetaphi] Do you know what still needs to be done before merging back to master? When we are done, ore close to being done, I plan to send an email to the list to ask for some more eyes on changes that I did before merging, especially the Observable/Observer removal. > Bump minimum Java version requirement to 11 > --- > > Key: LUCENE-8738 > URL: https://issues.apache.org/jira/browse/LUCENE-8738 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Reporter: Adrien Grand >Priority: Minor > Labels: Java11 > Fix For: master (9.0) > > > See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11
[ https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814823#comment-16814823 ] Adrien Grand commented on LUCENE-8738: -- [~thetaphi] I tested Eclipse indeed. I only had issue with MockInitialContextFactory, Eclipse complains that it tries to access classes from a module it doesn't have access to. > Bump minimum Java version requirement to 11 > --- > > Key: LUCENE-8738 > URL: https://issues.apache.org/jira/browse/LUCENE-8738 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Reporter: Adrien Grand >Priority: Minor > Labels: Java11 > Fix For: master (9.0) > > > See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8725) Make TermsQuery.SeekingTermSetTermsEnum public
[ https://issues.apache.org/jira/browse/LUCENE-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815143#comment-16815143 ] Adrien Grand commented on LUCENE-8725: -- +1 to the patch, let's maybe make it internal rather than experimental? > Make TermsQuery.SeekingTermSetTermsEnum public > -- > > Key: LUCENE-8725 > URL: https://issues.apache.org/jira/browse/LUCENE-8725 > Project: Lucene - Core > Issue Type: Wish >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Trivial > Fix For: 8.1 > > Attachments: LUCENE-8725.patch > > > I have come across use-cases where directly accessing {{TermsQuery}} can > help. If there is no objection I would like to make it public -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org