[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

David Smiley (JIRA) Tue, 12 Jun 2018 12:46:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510092#comment-16510092
 ]


David Smiley commented on LUCENE-8041:
--------------------------------------

bq. I do wonder if an intermediate step would be to have a map + a 
Iterable<String> so we don't need to do this sorting over and over again?

Absolutely; there's no need to re-sort.  I'm working with 
[[email protected]] who came up with this tidy solution.  In 
BlockTreeTermsReader, the "fields" field becomes a HashMap.  Then, in the 
constructor after it's populated, there's a few lines to build up the list:
{code:java}
ArrayList<String> fieldsNames = new ArrayList<String>(fields.keySet());
Collections.sort(fieldsNames);
fieldsNamesIterable = Collections.unmodifiableCollection(fieldsNames);
{code}
{{fieldsNamesIterable}} is a new declared field.
Very similar logic goes in PerFieldsPostingsFormat.

I think it's possible to avoid the sort() in BlockTreeTermsReader if you know 
you're reading data written pre-sorted.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> --------------------------------------------------------
>
>                 Key: LUCENE-8041
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8041
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

Reply via email to