[jira] [Commented] (LUCENE-4902) Add a FilterDirectoryReader

2013-04-05 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623748#comment-13623748
 ] 

Adrien Grand commented on LUCENE-4902:
--

+1

> Add a FilterDirectoryReader
> ---
>
> Key: LUCENE-4902
> URL: https://issues.apache.org/jira/browse/LUCENE-4902
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-4902.patch, LUCENE-4902.patch
>
>
> A FilterDirectoryReader would allow you to easily wrap all subreaders of a 
> DirectoryReader with FilterAtomicReaders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4903) Add AssertingScorer

2013-04-05 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4903:
-

Attachment: LUCENE-4903.patch

Patch

 * checks for in-order scoring when applicable

 * checks score values (not INFINITY or NaN)

 * checks that Scorer.score() is not called before iteration started or after 
it finished

 * reuses assertions of DocsEnum on Scorer

 * makes sure that nextDoc() and advance(target) are not called directly on 
"top scorers" (only from score(Collector)).

 * makes more tests use LuceneTestCase.newSearcher (most of the patch size)

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4676) Share a Lucene FieldType instance instead of creating on each call to createField()

2013-04-05 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13623762#comment-13623762
 ] 

Adrien Grand commented on SOLR-4676:


{quote}
I agree with both of these statements. Can we remove createField() and 
eliminate this trap?
DocumentBuilder only calls createFields() and thats... the only thing that 
should be calling this method?
{quote}

+1

> Share a Lucene FieldType instance instead of creating on each call to 
> createField()
> ---
>
> Key: SOLR-4676
> URL: https://issues.apache.org/jira/browse/SOLR-4676
> Project: Solr
>  Issue Type: Improvement
>  Components: Schema and Analysis
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Minor
> Attachments: SOLR-4676_Share_Lucene_FieldType_in_SchemaField.patch
>
>
> I think the Lucene FieldType instances should be cached on Solr's SchemaField 
> so that they don't have to be needlessly re-created for each indexed value 
> that runs through Solr in SchemaField.createField(). The only obstacle I see 
> to this is that getIndexOptions(field,val) takes the value, and if that value 
> were to alter the logic then the FieldType can't be shared. This is a 
> protected method and I don't see anything that overrides it, and the default 
> implementation doesn't use the value. So I think it can be removed.  Patch in 
> progress...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4858) Early termination with SortingMergePolicy

2013-04-06 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4858:
-

Attachment: LUCENE-4858.patch

Thanks Shai, this looks good! I modified a bit your patch to fix the collector 
constructor visiblity (from protected to public) and added some documentation. 
I'd like to discuss whether we should actually add the name of the Sorter class 
in the "sorter" property of the diagnostics. I would rather remove it so that 
renaming a Sorter class doesn't break compatibility, what do you think?

> Early termination with SortingMergePolicy
> -
>
> Key: LUCENE-4858
> URL: https://issues.apache.org/jira/browse/LUCENE-4858
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, 
> LUCENE-4858.patch, LUCENE-4858.patch
>
>
> Spin-off of LUCENE-4752, see 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
>  and 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
> When an index is sorted per-segment, queries that sort according to the index 
> sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4903) Add AssertingScorer

2013-04-06 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624419#comment-13624419
 ] 

Adrien Grand commented on LUCENE-4903:
--

The problem is that scorers are hard to track: scoring usually happens by 
calling Scorer.score(Collector), which itself calls 
Collector.setScorer(Scorer). Since the asserting scorer delegates to the 
wrapped one, the asserting scorer gets lost, this is why Collector.setScorer 
tries to get it back by using a weak hash map.

I'm not totally happy with it either and would really like to make 
Scorer.score(Collector) use methods from the asserting scorer directly. We 
can't rely on Scorer.score(Collector)'s default implementation since it relies 
on Scorer.nextDoc and some scorers such as BooleanScorer don't implement this 
method.

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4911) Missing word "cela" in conf/lang/stopwords_fr.txt

2013-04-06 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4911.
--

Resolution: Fixed

Pierre, I just applied your patch to Lucene's stop list 
(http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt?view=diff&r1=1465255&r2=1465256&pathrev=1465256).
 Thank you! This fix should be available in Lucene/Solr 4.3.

I also sent an email to snowball-discuss to mention this improvement: 
http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001462.html

> Missing word "cela" in conf/lang/stopwords_fr.txt
> -
>
> Key: LUCENE-4911
> URL: https://issues.apache.org/jira/browse/LUCENE-4911
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.2
>Reporter: Pierre Kobylanski
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: stopwords_fr.txt.patch
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> NB: Not sure this defect is assigned to the right component.
> In file example/solr/collection1/conf/lang/stopwords_fr.txt,
> there is the word "celà". Though incorrect in French (cf 
> http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add 
> the correct spelling (e.g. "cela", whitout accent) to that stopwords list.
> Another thing: I noticed that "celà" is the only word of the list followed by 
> an unbreakable space. Is that wanted?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626429#comment-13626429
 ] 

Adrien Grand commented on LUCENE-4858:
--

Thanks for updating the patch, Shai.

bq. Adrien, do we have anything else to do here, or are we ready to go? If so, 
I'll add a CHANGES entry and commit later.

The patch looks good to me. Maybe NumericDocValuesSorter.getID() could just 
return 'fieldName'? I think it's not necessary to describe the doc values type 
since they are exclusive and doc values are the natural way to sort documents 
by field values in Lucene? Otherwise +1.

> Early termination with SortingMergePolicy
> -
>
> Key: LUCENE-4858
> URL: https://issues.apache.org/jira/browse/LUCENE-4858
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, 
> LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch
>
>
> Spin-off of LUCENE-4752, see 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
>  and 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
> When an index is sorted per-segment, queries that sort according to the index 
> sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626529#comment-13626529
 ] 

Adrien Grand commented on LUCENE-4858:
--

bq. The reason I did that is in case someone will want to sort by a stored 
field and numeric field which have same names.

A Sorter which sorts by stored field values would indeed need to add more 
information to its ID (at least to say that it is a stored field).

bq. "numericdv_field" is really unique, as you cannot have two numeric DV 
fields with the same name, but different meaning.

Since doc values types are exclusive, could we then just say that these are doc 
values without mentioning the type? I think this would help keep up with doc 
values types evolutions (for example there used to be BYTES_FIXED_SORTED and 
BYTES_VAR_SORTED which have been merged into SORTED) and/or additions 
(SORTED_SET). I would also prefer having something even more human-readable 
(like "DocValues(fieldName=$fieldName,order=asc|desc)"?).



> Early termination with SortingMergePolicy
> -
>
> Key: LUCENE-4858
> URL: https://issues.apache.org/jira/browse/LUCENE-4858
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, 
> LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch
>
>
> Spin-off of LUCENE-4752, see 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
>  and 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
> When an index is sorted per-segment, queries that sort according to the index 
> sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626548#comment-13626548
 ] 

Adrien Grand commented on LUCENE-4858:
--

Sounds good to me!

> Early termination with SortingMergePolicy
> -
>
> Key: LUCENE-4858
> URL: https://issues.apache.org/jira/browse/LUCENE-4858
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, 
> LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch
>
>
> Spin-off of LUCENE-4752, see 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
>  and 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
> When an index is sorted per-segment, queries that sort according to the index 
> sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4858) Early termination with SortingMergePolicy

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626637#comment-13626637
 ] 

Adrien Grand commented on LUCENE-4858:
--

+1

> Early termination with SortingMergePolicy
> -
>
> Key: LUCENE-4858
> URL: https://issues.apache.org/jira/browse/LUCENE-4858
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, 
> LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch, LUCENE-4858.patch
>
>
> Spin-off of LUCENE-4752, see 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13606565&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13606565
>  and 
> https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13607282&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13607282
> When an index is sorted per-segment, queries that sort according to the index 
> sort order could be early terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4903) Add AssertingScorer

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626638#comment-13626638
 ] 

Adrien Grand commented on LUCENE-4903:
--

This is a good idea, I didn't know of this class. I'll update the patch!

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4581) sort-order of facet-counts depends on facet.mincount

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626772#comment-13626772
 ] 

Adrien Grand commented on SOLR-4581:


Thanks for fixing the bug Yonik!

> sort-order of facet-counts depends on facet.mincount
> 
>
> Key: SOLR-4581
> URL: https://issues.apache.org/jira/browse/SOLR-4581
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2
>Reporter: Alexander Buhr
>Assignee: Yonik Seeley
> Fix For: 4.3, 5.0
>
> Attachments: SOLR-4581.patch, SOLR-4581.patch
>
>
> I just upgraded to Solr 4.2 and cannot explain the following behaviour:
> I am using a solr.TrieDoubleField named 'ListPrice_EUR_INV' as a facet-field. 
> The solr-response for the query 
> {noformat}'solr/Products/select?q=*%3A*&wt=xml&indent=true&facet=true&facet.field=ListPrice_EUR_INV&f.ListPrice_EUR_INV.facet.sort=index'{noformat}
> includes the following facet-counts:
> {noformat}
>   1
>   1
>   1
> {noformat}
> If I also set the parameter *'facet.mincount=1'* in the query, the order of 
> the facet-counts is reversed.
> {noformat}
>   1
>   1
>   1
> {noformat}
> I would have expected, that the sort-order of the facet-counts is not 
> affected by the facet.mincount parameter, as it is in Solr 4.1.
> Is this related to SOLR-2850? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4921) Create a DocValuesFormat for sparse doc values

2013-04-09 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4921:


 Summary: Create a DocValuesFormat for sparse doc values
 Key: LUCENE-4921
 URL: https://issues.apache.org/jira/browse/LUCENE-4921
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Adrien Grand
Priority: Trivial


We could have a special DocValuesFormat in lucene/codecs to better handle 
sparse doc values.

See http://search-lucene.com/m/HUeYW1RlEtc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626982#comment-13626982
 ] 

Adrien Grand commented on LUCENE-4904:
--

We can add this ReverseOrderSorter, but as far as NumericDocValuesSorter is 
concerned, I would rather have the abstraction at the level of the 
DocComparator rather than the Sorter. This would allow 
{{Sorter.sort(int,DocComparator)}} to quickly return null without allocating 
(potentially lots of) memory for the doc maps if the reader is already sorted. 
Additionally, this would allow for more readable diagnostics (such as 
"DocValues(fieldName,desc)" instead of "Reverse(DocValues(fieldName,asc))".


> Sorter API: Make NumericDocValuesSorter able to sort in reverse order
> -
>
> Key: LUCENE-4904
> URL: https://issues.apache.org/jira/browse/LUCENE-4904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Trivial
>  Labels: newdev
> Fix For: 4.3
>
> Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch
>
>
> Today it is only able to sort in ascending order.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4903) Add AssertingScorer

2013-04-09 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4903:
-

Attachment: LUCENE-4903.patch

New patch:

 * borrows Robert's idea to no delegate if the method has not been overridden,

 * AssertingScorer.score(Collector) either calls score(Collector) or 
score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean()

 * modifies some join scorers so that nextDoc throws UOE instead of iterating 
out of order

 * adds an assertion to Scorer.score(Collector) to make sure that iteration has 
not started before this method is called

 * adds an assertion to Scorer.score(Collector, int, int) to make sure that 
docID() == firstDocID

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch, LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4903) Add AssertingScorer

2013-04-09 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627291#comment-13627291
 ] 

Adrien Grand edited comment on LUCENE-4903 at 4/10/13 12:05 AM:


New patch:

 * borrows Robert's idea to not delegate if the method has not been overridden,

 * AssertingScorer.score(Collector) either calls score(Collector) or 
score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean()

 * modifies some join scorers so that nextDoc throws UOE instead of iterating 
out of order

 * adds an assertion to Scorer.score(Collector) to make sure that iteration has 
not started before this method is called

 * adds an assertion to Scorer.score(Collector, int, int) to make sure that 
docID() == firstDocID

  was (Author: jpountz):
New patch:

 * borrows Robert's idea to no delegate if the method has not been overridden,

 * AssertingScorer.score(Collector) either calls score(Collector) or 
score(Collector, NO_MORE_DOCS, nextDoc()) depending on random().nextBoolean()

 * modifies some join scorers so that nextDoc throws UOE instead of iterating 
out of order

 * adds an assertion to Scorer.score(Collector) to make sure that iteration has 
not started before this method is called

 * adds an assertion to Scorer.score(Collector, int, int) to make sure that 
docID() == firstDocID
  
> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch, LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4903) Add AssertingScorer

2013-04-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627594#comment-13627594
 ] 

Adrien Grand commented on LUCENE-4903:
--

bq. So we don't need the weak map anymore right?

It could still be useful to Scorers that override {{score(Collector 
collector)}} and call {{collector.setScorer(this)}} in the body of this method 
I think.

bq. maybe AssertingWeight's scorer() method should create a new 
Random(random.nextLong()) to pass to the AssertingScorer when it creates it?

Good point. I'll update the patch.

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch, LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order

2013-04-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627619#comment-13627619
 ] 

Adrien Grand commented on LUCENE-4904:
--

bq. This got me thinking if ascending/descending should be on the Sorter.sort 
API

I think it shouldn't for the reasons you mentioned.

The patch looks good to me, +1 to commit!

> Sorter API: Make NumericDocValuesSorter able to sort in reverse order
> -
>
> Key: LUCENE-4904
> URL: https://issues.apache.org/jira/browse/LUCENE-4904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Trivial
>  Labels: newdev
> Fix For: 4.3
>
> Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch, 
> LUCENE-4904.patch
>
>
> Today it is only able to sort in ascending order.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4904) Sorter API: Make NumericDocValuesSorter able to sort in reverse order

2013-04-10 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627653#comment-13627653
 ] 

Adrien Grand commented on LUCENE-4904:
--

It is OK for me.

> Sorter API: Make NumericDocValuesSorter able to sort in reverse order
> -
>
> Key: LUCENE-4904
> URL: https://issues.apache.org/jira/browse/LUCENE-4904
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Trivial
>  Labels: newdev
> Fix For: 4.3
>
> Attachments: LUCENE-4904.patch, LUCENE-4904.patch, LUCENE-4904.patch, 
> LUCENE-4904.patch
>
>
> Today it is only able to sort in ascending order.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-10 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4924:


 Summary: Make DocIdSetIterator.docID() return -1 when not 
positioned
 Key: LUCENE-4924
 URL: https://issues.apache.org/jira/browse/LUCENE-4924
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Priority: Minor
 Fix For: 5.0


Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
enum is not positioned. I would like to only allow it to return -1 so that we 
can have better assertions.

(This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-10 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4924:


Assignee: Adrien Grand

> Make DocIdSetIterator.docID() return -1 when not positioned
> ---
>
> Key: LUCENE-4924
> URL: https://issues.apache.org/jira/browse/LUCENE-4924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 5.0
>
>
> Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
> enum is not positioned. I would like to only allow it to return -1 so that we 
> can have better assertions.
> (This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE

2013-04-10 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4925:


 Summary: IndexSearcher.search is broken when 
IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE
 Key: LUCENE-4925
 URL: https://issues.apache.org/jira/browse/LUCENE-4925
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.2.1
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.3


When executor != null, IndexSearcher performs two passes to compute the top 
docs. This doesn't work when the sort contains SortField.FIELD_SCORE because 
the second pass doesn't have access to scores computed in the first pass.  
Since search(...) doesn't compute scores when there is a sort, they are all 
Float.NaN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE

2013-04-10 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4925:
-

Attachment: LUCENE-4925.patch

Patch. Without the patch applied, the new test in TestSort would fail whenever 
LuceneTestCase.newSearcher would return a Searcher that collects segments in 
parallel.

> IndexSearcher.search is broken when IndexSearcher.executor != null and the 
> sort contains SortField.FIELD_SCORE
> --
>
> Key: LUCENE-4925
> URL: https://issues.apache.org/jira/browse/LUCENE-4925
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.2.1
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.3
>
> Attachments: LUCENE-4925.patch
>
>
> When executor != null, IndexSearcher performs two passes to compute the top 
> docs. This doesn't work when the sort contains SortField.FIELD_SCORE because 
> the second pass doesn't have access to scores computed in the first pass.  
> Since search(...) doesn't compute scores when there is a sort, they are all 
> Float.NaN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4925) IndexSearcher.search is broken when IndexSearcher.executor != null and the sort contains SortField.FIELD_SCORE

2013-04-10 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4925.
--

Resolution: Fixed

> IndexSearcher.search is broken when IndexSearcher.executor != null and the 
> sort contains SortField.FIELD_SCORE
> --
>
> Key: LUCENE-4925
> URL: https://issues.apache.org/jira/browse/LUCENE-4925
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 4.2.1
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.3
>
> Attachments: LUCENE-4925.patch
>
>
> When executor != null, IndexSearcher performs two passes to compute the top 
> docs. This doesn't work when the sort contains SortField.FIELD_SCORE because 
> the second pass doesn't have access to scores computed in the first pass.  
> Since search(...) doesn't compute scores when there is a sort, they are all 
> Float.NaN.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4903) Add AssertingScorer

2013-04-10 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4903.
--

Resolution: Fixed

I just committed. Hopefully this will find bugs in Scorers!

> Add AssertingScorer
> ---
>
> Key: LUCENE-4903
> URL: https://issues.apache.org/jira/browse/LUCENE-4903
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-4903.patch, LUCENE-4903.patch
>
>
> I think we would benefit from having an AssertingScorer that would assert 
> that scorers are advanced correctly, return valid scores (eg. not NaN), ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4911) Missing word "cela" in conf/lang/stopwords_fr.txt

2013-04-11 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629459#comment-13629459
 ] 

Adrien Grand commented on LUCENE-4911:
--

For your information, Martin Porter (himself!) added cela to the upstream stop 
list 
(http://lists.tartarus.org/mailman/private/snowball-discuss/2013-April/001466.html).

> Missing word "cela" in conf/lang/stopwords_fr.txt
> -
>
> Key: LUCENE-4911
> URL: https://issues.apache.org/jira/browse/LUCENE-4911
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 4.2
>Reporter: Pierre Kobylanski
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: stopwords_fr.txt.patch
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> NB: Not sure this defect is assigned to the right component.
> In file example/solr/collection1/conf/lang/stopwords_fr.txt,
> there is the word "celà". Though incorrect in French (cf 
> http://fr.wiktionary.org/wiki/cel%C3%A0), it's common, but we may also add 
> the correct spelling (e.g. "cela", whitout accent) to that stopwords list.
> Another thing: I noticed that "celà" is the only word of the list followed by 
> an unbreakable space. Is that wanted?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable

2013-04-11 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4928:


 Summary: Compressed stored fields: make the maximum number of docs 
in a chunk configurable
 Key: LUCENE-4928
 URL: https://issues.apache.org/jira/browse/LUCENE-4928
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Fix For: 4.3


When documents are very small (a few bytes), there can be so many of them in a 
single chunk that merging can become very slow. Making the maximum number of 
documents per chunk configurable could help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable

2013-04-11 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629604#comment-13629604
 ] 

Adrien Grand commented on LUCENE-4928:
--

I'm looking at the term vectors format, and it can't have a configurable number 
of documents per chunk without changing the format (it would need to store the 
max number of documents per chunk to be able at merging time to decide on 
whether it can bulk-merge the next chunk). So for now I think we can just have 
a hard limit and make it configurable in the future if we have a need for it?

> Compressed stored fields: make the maximum number of docs in a chunk 
> configurable
> -
>
> Key: LUCENE-4928
> URL: https://issues.apache.org/jira/browse/LUCENE-4928
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
>
> When documents are very small (a few bytes), there can be so many of them in 
> a single chunk that merging can become very slow. Making the maximum number 
> of documents per chunk configurable could help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable

2013-04-11 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4928:
-

Attachment: LUCENE-4928.patch

Proposed patch.

> Compressed stored fields: make the maximum number of docs in a chunk 
> configurable
> -
>
> Key: LUCENE-4928
> URL: https://issues.apache.org/jira/browse/LUCENE-4928
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4928.patch
>
>
> When documents are very small (a few bytes), there can be so many of them in 
> a single chunk that merging can become very slow. Making the maximum number 
> of documents per chunk configurable could help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-4706) LZ4.decompress() throws ArrayIndexOutOfBoundsException

2013-04-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned SOLR-4706:
--

Assignee: Adrien Grand

> LZ4.decompress()  throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SOLR-4706
> URL: https://issues.apache.org/jira/browse/SOLR-4706
> Project: Solr
>  Issue Type: Bug
>  Components: search, SearchComponents - other
>Affects Versions: 4.2, 4.2.1
>Reporter: Victor Ruiz
>Assignee: Adrien Grand
>
> The exception is thrown for all components I'm using: RealTimeGetHandler, 
> TermVectorComponent, MoreLikethis, SearchHandler.
> Here 2 trace errors:
> http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq=
>  date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz 
> literatur)&rows=250
> {quote}
> \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518},
> \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat
>  org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat 
> org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
>  
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat
>  org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat 
> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat
>  
> org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat
>  
> org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat
>  
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat
>  
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat
>  
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat
>  org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat
>  
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat
>  
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat
>  org.mortbay.jetty.Server.handle(Server.java:326)\n\tat 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat
>  org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat 
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat 
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat
>  
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}}
> {quote}
> http://localhost:8984/solr/osr/get?id=105266867
> {quote}
> \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver
>  kann man kaum würzen  safran ist das teuerste gewürz der welt handverlesen 
> und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird 
> in winzigen mengen gehandelt und 
> verwendet","title":"safran","domainid":4287,"date_i":"2012-11-21T17:01:23Z","date":"2012-11-21T17:01:09Z","category":\["kultur","literatur","gesellschaft","umwelt","trinken","essen"]}]},"termVectors":\["uniqueKe

[jira] [Commented] (SOLR-4706) LZ4.decompress() throws ArrayIndexOutOfBoundsException

2013-04-12 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629951#comment-13629951
 ] 

Adrien Grand commented on SOLR-4706:


Thanks for reporting the issue Victor. Can you reproduce the issue if you 
reindex your documents? I'd be happy to take a look at the index too if you can 
share it with us.

> LZ4.decompress()  throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SOLR-4706
> URL: https://issues.apache.org/jira/browse/SOLR-4706
> Project: Solr
>  Issue Type: Bug
>  Components: search, SearchComponents - other
>Affects Versions: 4.2, 4.2.1
>Reporter: Victor Ruiz
>
> The exception is thrown for all components I'm using: RealTimeGetHandler, 
> TermVectorComponent, MoreLikethis, SearchHandler.
> Here 2 trace errors:
> http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq=
>  date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz 
> literatur)&rows=250
> {quote}
> \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518},
> \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat
>  org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat 
> org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
>  
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat
>  org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat 
> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat
>  
> org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat
>  
> org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat
>  
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat
>  
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat
>  
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat
>  org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat
>  
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat
>  
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat
>  org.mortbay.jetty.Server.handle(Server.java:326)\n\tat 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat
>  org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat 
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat 
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat
>  
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}}
> {quote}
> http://localhost:8984/solr/osr/get?id=105266867
> {quote}
> \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver
>  kann man kaum würzen  safran ist das teuerste gewürz der welt handverlesen 
> und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird 
> in winzigen mengen gehandelt und 
> verwendet","title":"safran","domainid

[jira] [Updated] (SOLR-4707) LZ4.decompress() throws ArrayIndexOutOfBoundsException

2013-04-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated SOLR-4707:
---

Assignee: (was: Adrien Grand)

> LZ4.decompress()  throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SOLR-4707
> URL: https://issues.apache.org/jira/browse/SOLR-4707
> Project: Solr
>  Issue Type: Bug
>  Components: replication (java)
>Affects Versions: 4.2, 4.2.1
>Reporter: Victor Ruiz
>
> The exception is thrown for all components I'm using: RealTimeGetHandler, 
> TermVectorComponent, MoreLikethis, SearchHandler.
> Here 2 trace errors:
> http://localhost:8984/solr/osr/mlt?q=itemid:76069564&mlt.boost=true&fq=domainid:13554&fq=
>  date_i:[NOW/DAY-30DAY TO NOW/DAY+1DAY]&fq=category:(kunst_und_kultur schweiz 
> literatur)&rows=250
> {quote}
> \{"response":\{"numFound":70253,"start":0,"maxScore":1.311772,"docs":\[\{"itemid":"116987750","score":1.311772},\{"itemid":"77298475","score":1.2506518},
> \{"itemid":"78497083","score":0.48435652},\{"itemid":"101957016","score":0.4811761},\{"itemid":"76771601","score":0.4811761},\{"itemid":"90468738","score":0.4811761},\{"itemid":"79075873","score":0.4811761},\{"itemid":"76837622","score":0.48091167},\{"itemid":"77206876","sco\{"error":\{"trace":"java.lang.ArrayIndexOutOfBoundsException\n\tat
>  org.apache.lucene.codecs.compressing.LZ4.decompress(LZ4.java:132)\n\tat 
> org.apache.lucene.codecs.compressing.CompressionMode$4.decompress(CompressionMode.java:135)\n\tat
>  
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:258)\n\tat
>  org.apache.lucene.index.SegmentReader.document(SegmentReader.java:139)\n\tat 
> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:116)\n\tat
>  
> org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:643)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\n\tat
>  
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:177)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\n\tat
>  
> org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\n\tat
>  
> org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:627)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:358)\n\tat
>  
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)\n\tat
>  
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)\n\tat
>  
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)\n\tat
>  
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)\n\tat
>  
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)\n\tat
>  org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)\n\tat 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)\n\tat
>  
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)\n\tat
>  
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)\n\tat
>  org.mortbay.jetty.Server.handle(Server.java:326)\n\tat 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)\n\tat 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:926)\n\tat
>  org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)\n\tat 
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)\n\tat 
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)\n\tat 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)\n\tat
>  
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)\n","code":500}}
> {quote}
> http://localhost:8984/solr/osr/tv?q=itemid:105266867
> {quote}
> \{"responseHeader":\{"status":500,"QTime":1},"response":\{"numFound":1,"start":0,"docs":\[\{"itemid":"105266867","text":"exklusiver
>  kann man kaum würzen  safran ist das teuerste gewürz der welt handverlesen 
> und in mühevoller kleinstarbeit hergestellt ist safran sehr selten und wird 
> in winzigen mengen gehandelt und 
> verwendet","title":"safran","domainid":4287,"date_i":"2012-11-21T17:01:23Z","date":"2012-11-21T17:01:09Z","category":\["kultur","literatur","gesellschaft","umwelt","trinken","essen"]}]},"termVectors":\["uniqueKeyFieldName","itemid","105266867",["uniqu

[jira] [Updated] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4924:
-

Attachment: LUCENE-4924.patch

Patch.

> Make DocIdSetIterator.docID() return -1 when not positioned
> ---
>
> Key: LUCENE-4924
> URL: https://issues.apache.org/jira/browse/LUCENE-4924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4924.patch
>
>
> Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
> enum is not positioned. I would like to only allow it to return -1 so that we 
> can have better assertions.
> (This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-12 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4924:
-

Attachment: LUCENE-4924.patch

Thanks Robert, I ran lucene tests and they all passed. I updated the patch to 
make the CHANGES entry clearer.

> Make DocIdSetIterator.docID() return -1 when not positioned
> ---
>
> Key: LUCENE-4924
> URL: https://issues.apache.org/jira/browse/LUCENE-4924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch
>
>
> Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
> enum is not positioned. I would like to only allow it to return -1 so that we 
> can have better assertions.
> (This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-15 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631697#comment-13631697
 ] 

Adrien Grand commented on LUCENE-4924:
--

I plan to commit soon and backport everything to 4.x but the changes entry and 
the DocIdSetIterator.docID() javadoc change.

> Make DocIdSetIterator.docID() return -1 when not positioned
> ---
>
> Key: LUCENE-4924
> URL: https://issues.apache.org/jira/browse/LUCENE-4924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch
>
>
> Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
> enum is not positioned. I would like to only allow it to return -1 so that we 
> can have better assertions.
> (This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4928) Compressed stored fields: make the maximum number of docs in a chunk configurable

2013-04-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4928.
--

Resolution: Fixed

> Compressed stored fields: make the maximum number of docs in a chunk 
> configurable
> -
>
> Key: LUCENE-4928
> URL: https://issues.apache.org/jira/browse/LUCENE-4928
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-4928.patch
>
>
> When documents are very small (a few bytes), there can be so many of them in 
> a single chunk that merging can become very slow. Making the maximum number 
> of documents per chunk configurable could help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4924) Make DocIdSetIterator.docID() return -1 when not positioned

2013-04-15 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4924.
--

Resolution: Fixed

Thank you Robert and Yonik!

> Make DocIdSetIterator.docID() return -1 when not positioned
> ---
>
> Key: LUCENE-4924
> URL: https://issues.apache.org/jira/browse/LUCENE-4924
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Fix For: 5.0
>
> Attachments: LUCENE-4924.patch, LUCENE-4924.patch, LUCENE-4924.patch
>
>
> Today DocIdSetIterator.docID() can either return -1 or NO_MORE_DOCS when the 
> enum is not positioned. I would like to only allow it to return -1 so that we 
> can have better assertions.
> (This proposal is for trunk only.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4934) AssertingIndexSearcher should do basic QueryUtils/etc checks on every query

2013-04-15 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631824#comment-13631824
 ] 

Adrien Grand commented on LUCENE-4934:
--

+1

> AssertingIndexSearcher should do basic QueryUtils/etc checks on every query
> ---
>
> Key: LUCENE-4934
> URL: https://issues.apache.org/jira/browse/LUCENE-4934
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Robert Muir
>
> We can start with QueryUtils.check(query): which does some basic 
> hashcode/equals checks.
> Ideally we'd strengthen the checks as we fix problems: e.g. add explanations 
> verifications (checkExplanations) and then finally the more intense check() 
> that does more verifications with deleted docs/next/advance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4936) docvalues date compression

2013-04-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4936:


Assignee: Adrien Grand

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4937) sort order different in branch_4x than trunk

2013-04-17 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634033#comment-13634033
 ] 

Adrien Grand commented on LUCENE-4937:
--

Thanks Uwe!

> sort order different in branch_4x than trunk
> 
>
> Key: LUCENE-4937
> URL: https://issues.apache.org/jira/browse/LUCENE-4937
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Fix For: 4.3
>
> Attachments: LUCENE-4937.patch, LUCENE-4937.patch, 
> LUCENE-4937_test.patch, SOLR-4723_test.patch
>
>
> I will buy a beer to whoever figures out why +0 sorts before -0 in branch_4x, 
> but works correctly in trunk :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

Patch:

 * Adds MathUtil.gcd(long, long)

 * Adds "GCD compression" to Lucene42, Disk and CheapBastard.

 * Improves BaseDocValuesFormatTest which almost only tested "TABLE_COMPRESSED" 
with Lucene42DVF

 * No more attempts to compress storage when the values are known to be dense, 
such as SORTED ords.

I measured how slower doc values indexing is with these new checks, and it is 
completely unnoticeable with random or dense values since the GCD quickly 
reaches 1. When the GCD is larger, it only made indexing 2% slower (every doc 
has a single field which is a NumericDocValuesField). So I think it's fine.

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

New patch:

 * Computes the GCD based on deltas in order to be able to compress non-UTC 
dates.

 * Adds support for TABLE_COMPRESSED to DiskDVF.

 * Adds tests that ensure that these new compression methods are actually used 
whenever applicable.

 * Adds a quick description of the compression method to Lucene42DVF javadocs.

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636406#comment-13636406
 ] 

Adrien Grand commented on LUCENE-4936:
--

Thank you Uwe! Unfortunately, I just figured out that the patch is broken when 
v - minValue overflows (in Consumer.addNumericField). I need to think about a 
way to fix it...

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Fix Version/s: 4.4

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

Here is a work-around for the issue: the consumer stops trying to perform GCD 
compression as soon as it encounters a value outside the [ -MAX_VALUE/2 - 
MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of 
a reasonable use-case that would benefit from GCD compression and have values 
outside of this range?

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636488#comment-13636488
 ] 

Adrien Grand edited comment on LUCENE-4936 at 4/19/13 3:31 PM:
---

Here is a work-around for the issue: the consumer stops trying to perform GCD 
compression as soon as it encounters a value outside the [ -MAX_VALUE/2 , 
MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of 
a reasonable use-case that would benefit from GCD compression and have values 
outside of this range?

  was (Author: jpountz):
Here is a work-around for the issue: the consumer stops trying to perform 
GCD compression as soon as it encounters a value outside the [ -MAX_VALUE/2 - 
MAX_VALE/2 ] range. This prevents overflows from happening and I can't think of 
a reasonable use-case that would benefit from GCD compression and have values 
outside of this range?
  
> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636503#comment-13636503
 ] 

Adrien Grand commented on LUCENE-4936:
--

Thank you Robert, I'd love to have a review to make sure the patch is correct, 
especially for MathUtil.gcd and the DVConsumer.addNumericField logic.

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-19 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

Simple ideas are often the best ones, the new patch has a single loop! Thanks 
Robert!

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

+1 to the proposed changes!

Here is an updated patch that fixes the DVProducer constructors to open the 
data file and check the header in a try/finally block (so that data files are 
closed even if the header check fails).

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: LUCENE-4936.patch

+1 to the proposed changes!

Here is an updated patch that fixes the DVProducer constructors to open the 
data file and check the header in a try/finally block (so that data files are 
closed even if the header check fails).

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4936) docvalues date compression

2013-04-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Attachment: (was: LUCENE-4936.patch)

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-4936) docvalues date compression

2013-04-21 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4936:
-

Comment: was deleted

(was: +1 to the proposed changes!

Here is an updated patch that fixes the DVProducer constructors to open the 
data file and check the header in a try/finally block (so that data files are 
closed even if the header check fails).)

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4946) Refactor SorterTemplate

2013-04-21 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4946:


 Summary: Refactor SorterTemplate
 Key: LUCENE-4946
 URL: https://issues.apache.org/jira/browse/LUCENE-4946
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial


When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
able to add galloping support because it would have required to add new 
primitive operations in addition to compare and swap.

I started working on a prototype that uses inheritance to allow some sorting 
algorithms to rely on additional primitive operations. You can have a look at 
https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
beware it is a prototype and still misses proper documentation and good tests).

I think it would offer several advantages:
 - no more need to implement setPivot and comparePivot when using in-place 
merge sort or insertion sort,
 - the ability to use faster stable sorting algorithms at the cost of some 
memory overhead (our in-place merge sort is very slow),
 - the ability to implement properly algorithms that are useful on specific 
datasets but require different primitive operations (such as TimSort for 
partially-sorted data).

If you are interested in comparing these implementations with Arrays.sort, 
there is a Benchmark class in src/examples.

What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4936) docvalues date compression

2013-04-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638090#comment-13638090
 ] 

Adrien Grand commented on LUCENE-4936:
--

I guess the point was to avoid one level of indirection in case all values can 
be stored using a single byte. Maybe "(maxValue - minValue) > 256" should be 
replaced with "(maxValue - minValue) >= uniqueValues.size()"? This would ensure 
that table compression isn't used if values are alreadu dense?

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4936) docvalues date compression

2013-04-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638114#comment-13638114
 ] 

Adrien Grand commented on LUCENE-4936:
--

One advantage of DELTA_COMPRESSED is that it uses different numbers of bits per 
value per block. Even if max-min=200, it could still happen that most blocks 
only require 6 or 7 bits per value. If there are many blocks, this could save 
substantial disk/memory.

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4936) docvalues date compression

2013-04-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638117#comment-13638117
 ] 

Adrien Grand commented on LUCENE-4936:
--

bq.  In this case should we just take bitsRequired on both sides?

Yes, this makes sense !

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4955) NGramTokenFilter increments positions for each gram

2013-04-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641567#comment-13641567
 ] 

Adrien Grand commented on LUCENE-4955:
--

Given that offsets can't go backwards and that tokens in the same position must 
have the same start offset, I think that the only way to get NGramTokenFilter 
out of TestRandomChains' exclusion list (LUCENE-4641) is to fix position 
increments (this issue), change the order tokens are emitted in (LUCENE-3920) 
and stop modifying offsets? I know some people rely on the current behavior but 
I think it's more important to get this filter out of TestRandomChains' 
exclusions since it causes highlighting bugs and makes the term vectors files 
unnecessary larger.

> NGramTokenFilter increments positions for each gram
> ---
>
> Key: LUCENE-4955
> URL: https://issues.apache.org/jira/browse/LUCENE-4955
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Simon Willnauer
> Fix For: 5.0, 4.4
>
> Attachments: highlighter-test.patch, LUCENE-4955.patch
>
>
> NGramTokenFilter increments positions for each gram rather for the actual 
> token which can lead to rather funny problems especially with highlighting. 
> if this filter should be used for highlighting is a different story but today 
> this seems to be a common practice in many situations to highlight sub-term 
> matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB 
> since tokens are sorted by position which causes offsets to be mixed up due 
> to ngram token filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4955) NGramTokenFilter increments positions for each gram

2013-04-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641706#comment-13641706
 ] 

Adrien Grand commented on LUCENE-4955:
--

+1

I'll work on fixing NGramTokenizer and NGramTokenFilter.

> NGramTokenFilter increments positions for each gram
> ---
>
> Key: LUCENE-4955
> URL: https://issues.apache.org/jira/browse/LUCENE-4955
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Simon Willnauer
> Fix For: 5.0, 4.4
>
> Attachments: highlighter-test.patch, highlighter-test.patch, 
> LUCENE-4955.patch
>
>
> NGramTokenFilter increments positions for each gram rather for the actual 
> token which can lead to rather funny problems especially with highlighting. 
> if this filter should be used for highlighting is a different story but today 
> this seems to be a common practice in many situations to highlight sub-term 
> matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB 
> since tokens are sorted by position which causes offsets to be mixed up due 
> to ngram token filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4959) Incorrect return value from SimpleNaiveBayesClassifier.assignClass

2013-04-25 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4959:


Assignee: Adrien Grand

> Incorrect return value from SimpleNaiveBayesClassifier.assignClass 
> ---
>
> Key: LUCENE-4959
> URL: https://issues.apache.org/jira/browse/LUCENE-4959
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.2.1
>Reporter: Alexey Kutin
>Assignee: Adrien Grand
>  Labels: classification
>
> The local copy of BytesRef referenced by foundClass is affected by subsequent 
> TermsEnum.iterator.next() calls as the shared BytesRef.bytes changes. 
> If a term "test" gives a good match and a next term in the terms collection 
> is "classification" with a lower match score then the return result will be 
> "clas"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4957) Stop IndexWriter from writing broken term vector offset data in 5.0

2013-04-25 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642201#comment-13642201
 ] 

Adrien Grand commented on LUCENE-4957:
--

+1

> Stop IndexWriter from writing broken term vector offset data in 5.0
> ---
>
> Key: LUCENE-4957
> URL: https://issues.apache.org/jira/browse/LUCENE-4957
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>
> Today we allow this in (some analyzers are broken), and only reject them if 
> someone is indexing offsets into the postings lists.
> But we should ban this also when term vectors are enabled. Its time to stop 
> writing this broken data and let broken analyzers be broken.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4955) NGramTokenFilter increments positions for each gram

2013-04-25 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4955:
-

Attachment: LUCENE-4955.patch

I tried to iterate on Simon's patch:

 * NGramTokenFilter doesn't modify offsets and emits all n-grams of a single 
term at the same position

 * NGramTokenizer uses a sliding window.

 * NGramTokenizer and NGramTokenFilter removed from TestRandomChains exclusions.

It was very hard to add the compatibility version support to NGramTokenizer so 
there are now two distinct classes and the factory picks the right one 
depending on the Lucene match version.

Simon's highlighting test now fails because the highlighted content is 
different, but not because of a broken token stream.

> NGramTokenFilter increments positions for each gram
> ---
>
> Key: LUCENE-4955
> URL: https://issues.apache.org/jira/browse/LUCENE-4955
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Simon Willnauer
> Fix For: 5.0, 4.4
>
> Attachments: highlighter-test.patch, highlighter-test.patch, 
> LUCENE-4955.patch, LUCENE-4955.patch
>
>
> NGramTokenFilter increments positions for each gram rather for the actual 
> token which can lead to rather funny problems especially with highlighting. 
> if this filter should be used for highlighting is a different story but today 
> this seems to be a common practice in many situations to highlight sub-term 
> matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB 
> since tokens are sorted by position which causes offsets to be mixed up due 
> to ngram token filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4955) NGramTokenFilter increments positions for each gram

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4955.
--

Resolution: Fixed

> NGramTokenFilter increments positions for each gram
> ---
>
> Key: LUCENE-4955
> URL: https://issues.apache.org/jira/browse/LUCENE-4955
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Simon Willnauer
> Fix For: 5.0, 4.4
>
> Attachments: highlighter-test.patch, highlighter-test.patch, 
> LUCENE-4955.patch, LUCENE-4955.patch
>
>
> NGramTokenFilter increments positions for each gram rather for the actual 
> token which can lead to rather funny problems especially with highlighting. 
> if this filter should be used for highlighting is a different story but today 
> this seems to be a common practice in many situations to highlight sub-term 
> matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB 
> since tokens are sorted by position which causes offsets to be mixed up due 
> to ngram token filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4955) NGramTokenFilter increments positions for each gram

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4955:


Assignee: Adrien Grand

> NGramTokenFilter increments positions for each gram
> ---
>
> Key: LUCENE-4955
> URL: https://issues.apache.org/jira/browse/LUCENE-4955
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.3
>Reporter: Simon Willnauer
>Assignee: Adrien Grand
> Fix For: 5.0, 4.4
>
> Attachments: highlighter-test.patch, highlighter-test.patch, 
> LUCENE-4955.patch, LUCENE-4955.patch
>
>
> NGramTokenFilter increments positions for each gram rather for the actual 
> token which can lead to rather funny problems especially with highlighting. 
> if this filter should be used for highlighting is a different story but today 
> this seems to be a common practice in many situations to highlight sub-term 
> matches.
> I have a test for highlighting that uses ngram failing with a StringIOOB 
> since tokens are sorted by position which causes offsets to be mixed up due 
> to ngram token filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-3920) ngram tokenizer/filters create nonsense offsets if followed by a word combiner

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-3920.
--

Resolution: Fixed
  Assignee: Adrien Grand

Fixed by LUCENE-4955.

> ngram tokenizer/filters create nonsense offsets if followed by a word combiner
> --
>
> Key: LUCENE-3920
> URL: https://issues.apache.org/jira/browse/LUCENE-3920
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Attachments: LUCENE-3920_test.patch
>
>
> It seems like maybe its possibly applying the offsets from the wrong token?
> Because after shingling, the resulting token has a startOffset thats after 
> the endoffset.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-1227.
--

Resolution: Fixed

LUCENE-4955 fixed this issue.

> NGramTokenizer to handle more than 1024 chars
> -
>
> Key: LUCENE-1227
> URL: https://issues.apache.org/jira/browse/LUCENE-1227
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Hiroaki Kawai
>Priority: Minor
> Attachments: LUCENE-1227.patch, NGramTokenizer.patch, 
> NGramTokenizer.patch
>
>
> Current NGramTokenizer can't handle character stream that is longer than 
> 1024. This is too short for non-whitespace-separated languages.
> I created a patch for this issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-2947) NGramTokenizer shouldn't trim whitespace

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-2947.
--

Resolution: Fixed

NGramTokenizer doesn't trim whitespaces anymore (LUCENE-4955).

> NGramTokenizer shouldn't trim whitespace
> 
>
> Key: LUCENE-2947
> URL: https://issues.apache.org/jira/browse/LUCENE-2947
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 3.0.3
>Reporter: David Byrne
>Priority: Minor
> Attachments: LUCENE-2947.patch, NGramTokenizerTest.java
>
>
> Before I tokenize my strings, I am padding them with white space:
> String foobar = " " + foo + " " + bar + " ";
> When constructing term vectors from ngrams, this strategy has a couple 
> benefits.  First, it places special emphasis on the starting and ending of a 
> word.  Second, it improves the similarity between phrases with swapped words. 
>  " foo bar " matches " bar foo " more closely than "foo bar" matches "bar 
> foo".
> The problem is that Lucene's NGramTokenizer trims whitespace.  This forces me 
> to do some preprocessing on my strings before I can tokenize them:
> foobar.replaceAll(" ","$"); //arbitrary char not in my data
> This is undocumented, so users won't realize their strings are being 
> trim()'ed, unless they look through the source, or examine the tokens 
> manually.
> I am proposing NGramTokenizer should be changed to respect whitespace.  Is 
> there a compelling reason against this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-1224) NGramTokenFilter creates bad TokenStream

2013-04-26 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-1224.
--

Resolution: Fixed

All n-grams now have the same position and offsets as the original token 
(LUCENE-4955).

> NGramTokenFilter creates bad TokenStream
> 
>
> Key: LUCENE-1224
> URL: https://issues.apache.org/jira/browse/LUCENE-1224
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Reporter: Hiroaki Kawai
>Priority: Minor
> Fix For: 4.3
>
> Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2013-04-26 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643326#comment-13643326
 ] 

Adrien Grand commented on LUCENE-1227:
--

David, sorry I didn't know about your patch and happened to fix this issue as 
part of LUCENE-4955. Your patch seems to operate very similarly and adds 
supports for whitespace collapsing, is that correct? Don't hesitate to tell me 
if you think the current implementation needs improvements.

> NGramTokenizer to handle more than 1024 chars
> -
>
> Key: LUCENE-1227
> URL: https://issues.apache.org/jira/browse/LUCENE-1227
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Hiroaki Kawai
>Priority: Minor
> Attachments: LUCENE-1227.patch, NGramTokenizer.patch, 
> NGramTokenizer.patch
>
>
> Current NGramTokenizer can't handle character stream that is longer than 
> 1024. This is too short for non-whitespace-separated languages.
> I created a patch for this issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4963) Deprecate broken TokenFilter constructors

2013-04-27 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4963:


 Summary: Deprecate broken TokenFilter constructors
 Key: LUCENE-4963
 URL: https://issues.apache.org/jira/browse/LUCENE-4963
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
 Fix For: 4.4


We have some TokenFilters which are only broken with specific options. This 
includes:

 * TrimFilter when updateOffsets=true
 * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, 
TypeTokenFilter when enablePositionIncrements=false

I think we should deprecate these behaviors in 4.4 and remove them in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4959) Incorrect return value from SimpleNaiveBayesClassifier.assignClass

2013-04-27 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4959.
--

Resolution: Fixed

Thanks Alexey!

> Incorrect return value from SimpleNaiveBayesClassifier.assignClass 
> ---
>
> Key: LUCENE-4959
> URL: https://issues.apache.org/jira/browse/LUCENE-4959
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.2.1
>Reporter: Alexey Kutin
>Assignee: Adrien Grand
>  Labels: classification
> Attachments: LUCENE-4959.patch
>
>
> The local copy of BytesRef referenced by foundClass is affected by subsequent 
> TermsEnum.iterator.next() calls as the shared BytesRef.bytes changes. 
> If a term "test" gives a good match and a next term in the terms collection 
> is "classification" with a lower match score then the return result will be 
> "clas"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4966) Add CachingWrapperFilter.sizeInBytes()

2013-04-29 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644512#comment-13644512
 ] 

Adrien Grand commented on LUCENE-4966:
--

+1 I wish we had such methods for the terms index, norms/doc values, stored 
fields/term vectors index, etc. too in order to get a better understanding of 
how Lucene uses memory. 

> Add CachingWrapperFilter.sizeInBytes()
> --
>
> Key: LUCENE-4966
> URL: https://issues.apache.org/jira/browse/LUCENE-4966
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 5.0, 4.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-4966.patch
>
>
> I think it's useful to be able to check how much RAM a given CWF is using ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4936) docvalues date compression

2013-04-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4936.
--

Resolution: Fixed

> docvalues date compression
> --
>
> Key: LUCENE-4936
> URL: https://issues.apache.org/jira/browse/LUCENE-4936
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, LUCENE-4936.patch, 
> LUCENE-4936.patch, LUCENE-4936.patch
>
>
> DocValues fields can be very wasteful if you are storing dates (like solr's 
> TrieDateField does if you enable docvalues) and don't actually need all the 
> precision: e.g. "date-only" fields like date of birth with no time component, 
> time fields without milliseconds precision, and so on.
> Ideally we'd compute GCD of all the values to save space 
> (numberOfTrailingZeros is not really enough here), but i think we should at 
> least look for values like 8640, 360, and 1000 to be practical.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4963) Deprecate broken TokenFilter constructors

2013-04-29 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4963:
-

Attachment: LUCENE-4963.patch

Thanks Uwe for the advice. Here is a first patch:

 * Deprecate constructors that expose broken options and make them throw an 
IllegalArgumentException when the lucene match version is >= 4.4

 * Remove the same constructors from TestRandomChains' exclusion list.

 * Since enablePositionIncrements=true was used by the Analyzing and Fuzzy 
suggesters to ignore position holes, I had to make it an option in the 
suggesters themselves instead of the token streams.

 * More documentation in the oal.analysis package: PositionLengthAttribute and 
guidelines on writing non-corrupt token streams.

> Deprecate broken TokenFilter constructors
> -
>
> Key: LUCENE-4963
> URL: https://issues.apache.org/jira/browse/LUCENE-4963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4963.patch
>
>
> We have some TokenFilters which are only broken with specific options. This 
> includes:
>  * TrimFilter when updateOffsets=true
>  * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, 
> TypeTokenFilter when enablePositionIncrements=false
> I think we should deprecate these behaviors in 4.4 and remove them in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4963) Deprecate broken TokenFilter constructors

2013-04-30 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645378#comment-13645378
 ] 

Adrien Grand commented on LUCENE-4963:
--

Hi Uwe, thanks for doing the review! The patch applies to trunk and I plan to 
remove deprecations in a second step. Is it OK with you?

> Deprecate broken TokenFilter constructors
> -
>
> Key: LUCENE-4963
> URL: https://issues.apache.org/jira/browse/LUCENE-4963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4963.patch
>
>
> We have some TokenFilters which are only broken with specific options. This 
> includes:
>  * TrimFilter when updateOffsets=true
>  * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, 
> TypeTokenFilter when enablePositionIncrements=false
> I think we should deprecate these behaviors in 4.4 and remove them in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4970) NGramPhraseQuery is not boosted.

2013-04-30 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645442#comment-13645442
 ] 

Adrien Grand commented on LUCENE-4970:
--

Hi Shingo, you are right. NGramPhraseQuery.rewrite should propagate the boost 
to the rewritten query. Would yo like to submit a patch? (see 
http://wiki.apache.org/lucene-java/HowToContribute)

> NGramPhraseQuery is not boosted.
> 
>
> Key: LUCENE-4970
> URL: https://issues.apache.org/jira/browse/LUCENE-4970
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.1
>Reporter: Shingo Sasaki
>
> If I apply setBoost() method to NGramPhraseQuery, Score will not change.
> I think, setBoost() is forgatten after optimized in rewrite() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4970) NGramPhraseQuery is not boosted.

2013-04-30 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reassigned LUCENE-4970:


Assignee: Adrien Grand

> NGramPhraseQuery is not boosted.
> 
>
> Key: LUCENE-4970
> URL: https://issues.apache.org/jira/browse/LUCENE-4970
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.1
>Reporter: Shingo Sasaki
>Assignee: Adrien Grand
>
> If I apply setBoost() method to NGramPhraseQuery, Score will not change.
> I think, setBoost() is forgatten after optimized in rewrite() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4970) NGramPhraseQuery is not boosted.

2013-05-01 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4970.
--

Resolution: Fixed

Committed, thank you Shingo!

> NGramPhraseQuery is not boosted.
> 
>
> Key: LUCENE-4970
> URL: https://issues.apache.org/jira/browse/LUCENE-4970
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.1
>Reporter: Shingo Sasaki
>Assignee: Adrien Grand
> Attachments: LUCENE-4970.patch
>
>
> If I apply setBoost() method to NGramPhraseQuery, Score will not change.
> I think, setBoost() is forgatten after optimized in rewrite() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4970) NGramPhraseQuery is not boosted.

2013-05-01 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4970:
-

Fix Version/s: 4.4

> NGramPhraseQuery is not boosted.
> 
>
> Key: LUCENE-4970
> URL: https://issues.apache.org/jira/browse/LUCENE-4970
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 4.1
>Reporter: Shingo Sasaki
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4970.patch
>
>
> If I apply setBoost() method to NGramPhraseQuery, Score will not change.
> I think, setBoost() is forgatten after optimized in rewrite() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate

2013-05-02 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4946:
-

Attachment: LUCENE-4946.patch

This patch contains one base class Sorter and 3 implementations:
 * IntroSorter (improved quicksort like we had before but I think the name is 
better since it makes it clear that the worst case complexity is O(n ln(n)) 
instead of O(n^2) as with traditional quicksort
 * InPlaceMergeSort, the merge sort we had before.
 * TimSort, an improved version of the previous implementation that can gallop 
to make sorting even faster on partially-sorted data.

One major difference is that the end offsets are now exclusive. I tend to find 
it less confusing since you would now call {{sort(0, array.length)}} instead of 
{{sort(0, array.length - 1)}}.

Please let me know if you would like to review the patch!

> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate

2013-05-02 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4946:
-

Attachment: LUCENE-4946.patch

Add missing @lucene.internal.

> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4946.patch, LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4946) Refactor SorterTemplate

2013-05-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648271#comment-13648271
 ] 

Adrien Grand commented on LUCENE-4946:
--

bq. Its also useful for other projects, so its maybe a good idea to make a 
Apache Commons projects out of it.

Why not. Or maybe use an already existing commons project such as commons 
collections? I'll dig that...

bq. I found some code duplication

I'll fix that. The reason is that I modified ArrayUtil and CollectionUtil which 
have their own private Sorter implementations and then I added tests which 
required me to have concrete implementations in src/test. I'll merge them.

bq. We should remove the following from NOTICE.txt

I'll fix that too.

bq. Perhaps the best way to change it would be to give (startIndex, 
elementsCount) which still reads (0, array.length) in most cases and does not 
have the problems mentioned above...

I have no strong opinion about that. I think the reason I like the (from,to) 
option better is that List.subList and Arrays.copyOfRange have the same 
arguments. For example someone who wants to sort a sub-list with the JDK would 
do {{Collections.sort(list.subList(from,to))}}. So I think it'd be nice to make 
directly translatable to {{new InPlaceMergeSorter() \{ compare/swap 
\}.sort(from, to)}}.


> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4946.patch, LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4946) Refactor SorterTemplate

2013-05-03 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4946:
-

Attachment: LUCENE-4946.patch

New Patch:

 * no more code duplication between ArrayUtil and the test classes

 * ArrayUtil exposes a NATURAL_COMPARATOR to sort arrays based on the natural 
order (for objects that implement Comparable)

 * Removed references to CGlib in the NOTICE.

> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4946) Refactor SorterTemplate

2013-05-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648403#comment-13648403
 ] 

Adrien Grand commented on LUCENE-4946:
--

bq. make a Apache Commons projects out of it

I just left an email on their dev@ mailing-list to get their opinion about it: 
http://markmail.org/message/if5cgarhavzuy45j.

> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4977) Forbidden-apis: avoid calls to Collections.sort

2013-05-03 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-4977:


 Summary: Forbidden-apis: avoid calls to Collections.sort
 Key: LUCENE-4977
 URL: https://issues.apache.org/jira/browse/LUCENE-4977
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Priority: Minor


Collections.sort works by dumping its content into an array, sorting it with 
Arrays.sort and then getting the elements back into the list. On the contrary, 
CollectionUtil has the ability to sort in-place when the list supports 
random-access, this is more memory-efficient and maybe even faster in some 
cases.

We could use the forbidden-apis tool to prevent our code from calling 
Collections.sort.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4946) Refactor SorterTemplate

2013-05-03 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4946.
--

   Resolution: Fixed
Fix Version/s: 4.4

> Refactor SorterTemplate
> ---
>
> Key: LUCENE-4946
> URL: https://issues.apache.org/jira/browse/LUCENE-4946
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Fix For: 4.4
>
> Attachments: LUCENE-4946.patch, LUCENE-4946.patch, LUCENE-4946.patch
>
>
> When working on TimSort (LUCENE-4839), I was a little frustrated of not being 
> able to add galloping support because it would have required to add new 
> primitive operations in addition to compare and swap.
> I started working on a prototype that uses inheritance to allow some sorting 
> algorithms to rely on additional primitive operations. You can have a look at 
> https://github.com/jpountz/sorts/tree/master/src/java/net/jpountz/sorts (but 
> beware it is a prototype and still misses proper documentation and good 
> tests).
> I think it would offer several advantages:
>  - no more need to implement setPivot and comparePivot when using in-place 
> merge sort or insertion sort,
>  - the ability to use faster stable sorting algorithms at the cost of some 
> memory overhead (our in-place merge sort is very slow),
>  - the ability to implement properly algorithms that are useful on specific 
> datasets but require different primitive operations (such as TimSort for 
> partially-sorted data).
> If you are interested in comparing these implementations with Arrays.sort, 
> there is a Benchmark class in src/examples.
> What do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4963) Deprecate broken TokenFilter constructors

2013-05-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648559#comment-13648559
 ] 

Adrien Grand commented on LUCENE-4963:
--

I'll commit this soon unless someone objects.

> Deprecate broken TokenFilter constructors
> -
>
> Key: LUCENE-4963
> URL: https://issues.apache.org/jira/browse/LUCENE-4963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4963.patch
>
>
> We have some TokenFilters which are only broken with specific options. This 
> includes:
>  * TrimFilter when updateOffsets=true
>  * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, 
> TypeTokenFilter when enablePositionIncrements=false
> I think we should deprecate these behaviors in 4.4 and remove them in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4963) Deprecate broken TokenFilter constructors

2013-05-04 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-4963.
--

Resolution: Fixed

Thank you Uwe!

> Deprecate broken TokenFilter constructors
> -
>
> Key: LUCENE-4963
> URL: https://issues.apache.org/jira/browse/LUCENE-4963
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Fix For: 4.4
>
> Attachments: LUCENE-4963.patch
>
>
> We have some TokenFilters which are only broken with specific options. This 
> includes:
>  * TrimFilter when updateOffsets=true
>  * StopFilter, JapanesePartOfSpeechStopFilter, KeepWordFilter, LengthFilter, 
> TypeTokenFilter when enablePositionIncrements=false
> I think we should deprecate these behaviors in 4.4 and remove them in trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813164#comment-16813164
 ] 

Adrien Grand commented on LUCENE-8753:
--

bq. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I 
didn't understand why a different behavior between a small and a large index.

I think this is expected. Query processing needs to look up the term in the 
terms dict and then process documents that contain this term. When the index 
gets larger, postings usually grow more quickly than the terms dictionary, so 
processing postings takes more time relatively compared to looking up the term 
in the terms dictionary. Term dictionary lookup performance only really matters 
for queries that have few matches (which you somehow simulated by running the 
benchmark on wikimedium500k) and updates, which are simulated by the PKLookup 
task.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8708) Can we simplify conjunctions of range queries automatically?

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813199#comment-16813199
 ] 

Adrien Grand commented on LUCENE-8708:
--

Thanks Atri for giving it a try! This change is a bit too invasive to my taste 
given that this is only a nice feature to have. That said I don't really have 
ideas how to make it better... 

> Can we simplify conjunctions of range queries automatically?
> 
>
> Key: LUCENE-8708
> URL: https://issues.apache.org/jira/browse/LUCENE-8708
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: interval_range_clauses_merging0704.patch
>
>
> BooleanQuery#rewrite already has some logic to make queries more efficient, 
> such as deduplicating filters or rewriting boolean queries that wrap a single 
> positive clause to that clause.
> It would be nice to also simplify conjunctions of range queries, so that eg. 
> {{foo: [5 TO *] AND foo:[* TO 20]}} would be rewritten to {{foo:[5 TO 20]}}. 
> When constructing queries manually or via the classic query parser, it feels 
> unnecessary as this is something that the user can fix easily. However if you 
> want to implement a query parser that only allows specifying one bound at 
> once, such as Gmail ({{after:2018-12-31}} 
> https://support.google.com/mail/answer/7190?hl=en) or GitHub 
> ({{updated:>=2018-12-31}} 
> https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
>  then you might end up with inefficient queries if the end user specifies 
> both an upper and a lower bound. It would be nice if we optimized those 
> automatically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7386) Flatten nested disjunctions

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813211#comment-16813211
 ] 

Adrien Grand commented on LUCENE-7386:
--

For the record I had to disable the verification of scores for this run of the 
benchmark since this change removes intermediate casts to float which trigger 
slight changes in the produced scores.

> Flatten nested disjunctions
> ---
>
> Key: LUCENE-7386
> URL: https://issues.apache.org/jira/browse/LUCENE-7386
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7386.patch, LUCENE-7386.patch, LUCENE-7386.patch
>
>
> Now that coords are gone it became easier to flatten nested disjunctions. It 
> might sound weird to write nested disjunctions in the first place, but 
> disjunctions can be created implicitly by other queries such as 
> more-like-this, LatLonPoint.newBoxQuery, non-scoring synonym queries, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813346#comment-16813346
 ] 

Adrien Grand commented on LUCENE-8738:
--

There seems to be issues with links to the standard API. I wonder that it might 
be related to the move from package-list to element-list.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813445#comment-16813445
 ] 

Adrien Grand commented on LUCENE-8738:
--

Apparently the issue can be worked around by calling the file package-list 
locally, even though it is supposed to be called element-list with the move to 
modules. I'll push a fix shortly.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-09 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813565#comment-16813565
 ] 

Adrien Grand commented on LUCENE-8738:
--

Sorry Uwe, I don't understand what you are suggesting.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8619) Decrease I/O pressure of OfflineSorter

2019-04-10 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-8619.
--
Resolution: Not A Problem

This isn't a problem anymore now that Ignacio rewrote the merging of BKD trees 
as a selection problem rathen than a sorting problem.

> Decrease I/O pressure of OfflineSorter
> --
>
> Key: LUCENE-8619
> URL: https://issues.apache.org/jira/browse/LUCENE-8619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> OfflineSorter is likely I/O bound, yet it doesn't really try to relieve I/O. 
> For instance it always writes the length on 2 bytes, which is waseful when 
> used by BKDWriter since all byte[] arrays have exactly the same length. For 
> LatLonPoint, this is a 25% space overhead that we could remove.
> Doing lightweight compression on the fly might also help.
> As a data point, Ignacio told me that after indexing 60M shapes with 
> LatLonShape (1.65B triangles), the index directory was about 265GB and 
> dropped to 57GB when merging was over.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8759) BlockMaxConjunctionScorer's simplified way of computing max scores hurts performance

2019-04-10 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8759:


 Summary: BlockMaxConjunctionScorer's simplified way of computing 
max scores hurts performance
 Key: LUCENE-8759
 URL: https://issues.apache.org/jira/browse/LUCENE-8759
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


BlockMaxConjunctionScorer computes the minimum value that the score should have 
after each scorer in order to be able to interrupt scorer as soon as possible. 
For instance say scorers A, B and C produce maximum scores that are equal to 4, 
2 and 1. If the minimum competitive score is X, then the score after scoring A, 
B and C must be at least X, the score after scoring A and B must be at least 
X-1 and the score after scoring A must be at least X-1-2.

However this is made a bit more complex than that due to floating-point numbers 
and the fact that intermediate score values are doubles which only get casted 
to a float after all values have been summed up. In order to keep things 
simple, BlockMaxConjunctionScore has the following comment and code

{code}
// Also compute the minimum required scores for a hit to be competitive
// A double that is less than 'score' might still be converted to 
'score'
// when casted to a float, so we go to the previous float to avoid this 
issue
minScores[minScores.length - 1] = minScore > 0 ? 
Math.nextDown(minScore) : 0;
{code}

It simplifies the problem by calling Math.nextDown(minScore). However this is 
problematic because it defeats the fact that TopScoreDocCollector calls 
setMinCompetitiveScore on the float value that is immediately greater than the 
k-th greatest hit so far.

nextDown(minScore) is not the value that we need. The value that we need is the 
smallest double that converts to minScore when casted to a float, which would 
be half-way between nextDown(minScore) and minScore. In some cases this would 
help get better performance out of conjunctions, especially if some clauses 
produce constant scores.

MaxScoreSumPropagator#setMinCompetitiveScore has the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8760) Reconsider the best way to encode postings now that we can skip non-competitive hits

2019-04-10 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8760:


 Summary: Reconsider the best way to encode postings now that we 
can skip non-competitive hits
 Key: LUCENE-8760
 URL: https://issues.apache.org/jira/browse/LUCENE-8760
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


The fact that we now skip non competitive hits has some implications to our 
postings:
 - we are now more likely to call advance vs. nextDoc
 - we are less likely to read term frequency for a given doc, since we only do 
that if the maximum score reported by impacts is competitive
 - we are less likely to read positions for a given doc, since exact phrase 
queries first check the maximum score that would be obtained with a phrase freq 
equal to the minimum of all term freqs

It might be a good opportunity to re-explore the best way to encode postings.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8762) Lucene50PostingsReader should specialize reading docs+freqs with impacts

2019-04-10 Thread Adrien Grand (JIRA)
Adrien Grand created LUCENE-8762:


 Summary: Lucene50PostingsReader should specialize reading 
docs+freqs with impacts
 Key: LUCENE-8762
 URL: https://issues.apache.org/jira/browse/LUCENE-8762
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


Currently if you ask for impacts, we only have one implementation that is able 
to expose everything: docs, freqs, positions and offsets. In contrast, if you 
don't need impacts, we have specialization for docs+freqs, docs+freqs+positions 
and docs+freqs+positions+offsets.

Maybe we should add specialization for the docs+freqs case with impacts, which 
should be the most common case, and remove specialization for 
docs+freqs+positions when impacts are not requested?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814577#comment-16814577
 ] 

Adrien Grand commented on LUCENE-8738:
--

[~thetaphi] Do you know what still needs to be done before merging back to 
master? When we are done, ore close to being done, I plan to send an email to 
the list to ask for some more eyes on changes that I did before merging, 
especially the Observable/Observer removal.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8738) Bump minimum Java version requirement to 11

2019-04-10 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814823#comment-16814823
 ] 

Adrien Grand commented on LUCENE-8738:
--

[~thetaphi] I tested Eclipse indeed. I only had issue with 
MockInitialContextFactory, Eclipse complains that it tries to access classes 
from a module it doesn't have access to.

> Bump minimum Java version requirement to 11
> ---
>
> Key: LUCENE-8738
> URL: https://issues.apache.org/jira/browse/LUCENE-8738
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: Java11
> Fix For: master (9.0)
>
>
> See vote thread for reference: https://markmail.org/message/q6ubdycqscpl43aq.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8725) Make TermsQuery.SeekingTermSetTermsEnum public

2019-04-11 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815143#comment-16815143
 ] 

Adrien Grand commented on LUCENE-8725:
--

+1 to the patch, let's maybe make it internal rather than experimental?

> Make TermsQuery.SeekingTermSetTermsEnum public
> --
>
> Key: LUCENE-8725
> URL: https://issues.apache.org/jira/browse/LUCENE-8725
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Trivial
> Fix For: 8.1
>
> Attachments: LUCENE-8725.patch
>
>
> I have come across use-cases where directly accessing {{TermsQuery}} can 
> help. If there is no objection I would like to make it public



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



<    6   7   8   9   10   11   12   13   14   15   >