[jira] [Updated] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7905:
---
Attachment: LUCENE-7905.patch

New patch, adding a warning NOTE to {{OrdinalMap}}'s javadocs.  I think it's 
ready.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7905:
---
Attachment: LUCENE-7905.patch

New patch, folding in Adrien's suggestion to fully decouple {{MultiTermsEnum}} 
and {{OrdinalMap}}, and adding a couple new TODOs.  I think it's ready.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905.patch, 
> LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7905:
---
Attachment: LUCENE-7905-specialized.patch

I tried specializing the PQ (patch attached) but it was only a small 
improvement so I don't plan to pursue it any more.  It could be I should have 
also specialized the logic to look for all subs sharing the same term?  But I 
ran out of time I have to play with this so much ;)

I also ran YourKit and found some silly slow code in 
{{DefaultSortedSetDocValuesReaderState}}, where it does a linear scan of all 
ord -> BytesRef to find the ord range of each dimension; we could do this much 
more efficiently with binary searching instead.  I'll put a TODO but don't plan 
to tackle that now/here.  That was 12% of the time.

Other hot areas were calling {{next()}} on the doc values (~25%) and 
{{updateTop}} in the tem queue (~46%).

I do feel like how we compare terms in the PQ is inefficient, and we should be 
able to do something like what radix sort does, because at any given time, the 
terms in the queue likely share long common prefixes yet we keep inefficiently 
re-comparing those long common prefixes.  It seems like there should be a 
powerful optimization here, where if we could somehow efficiently know that 
e.g. right now all terms have a common prefix of length=5, and then only do our 
comparisons starting from the 6th digit, or something ... but I don't see how 
to do it.


> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch, LUCENE-7905-specialized.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7905) Optimizations for OrdinalMap

2017-07-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-7905:
---
Attachment: LUCENE-7905.patch

Patch; I think it's ready.

> Optimizations for OrdinalMap
> 
>
> Key: LUCENE-7905
> URL: https://issues.apache.org/jira/browse/LUCENE-7905
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 7.1
>
> Attachments: LUCENE-7905.patch
>
>
> {{OrdinalMap}} is a useful class to quickly map per-segment ordinals to 
> global space, but it's fairly costly to build, which must typically be done 
> on every NRT refresh.
> I'm using it quite heavily in two different places, one for 
> {{SortedSetDocValuesFacetCounts}}, and another custom usage, and I found some 
> small optimizations to improve its construction time.
> I switched it to use a simple priority queue to merge the terms instead of 
> the more general {{MultiTermsEnum}}, which does extra work since it must also 
> provide postings, implement seekExact, etc.
> I also pulled {{OrdinalMap}} out into its own oal.index class.
> When testing construction time for my case the patch is ~16% faster (159.9s 
> -> 134.2s) in one case with 91.4 M terms and ~9% faster (115.6s -> 105.7s) in 
> another case with 26.6 M terms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org