[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

Uwe Schindler (JIRA) Thu, 16 May 2019 01:44:34 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841123#comment-16841123
 ]


Uwe Schindler commented on LUCENE-8041:
---------------------------------------

I have a maybe better suggestion instead of:
- Build a TreeMap like before and name it sortedField()
- Then rebuild it as LinkedHashMap

The problem here is that we have to build a possibly huge map, just to copy it 
to a LinkedHashMap, and hotspot can't easily optimize away the creation on 
heap. The result is fine (as said before): The LinkedHashMap needs a bit more 
heap space than a plain HashMap, but should be of almost identical size to a 
TreeMap (which uses a lot of small inner object instances).

My suggestion would be: You just iterate over the fields, do some 
transformations and then finally add them to a TreeMap. Rewrite that to use 
Java Streams - it may not work for all cases (especially if the iteration has 
I/O involved and the order is important), but e.g. for direct fields it may 
work (possibly the IOException needs to be wrapped):

{code:java}
     public DirectFields(SegmentReadState state, Fields fields, int 
minSkipCount, int lowFreqCutoff) throws IOException {
       this.fields = StreamSupport.stream(fields.spliterator(), false)
        .sorted()   // <== that's the trick
        .collect(Collectors.toMap(Function.identity(), field -> new 
DirectField(state, field, fields.terms(field), minSkipCount, lowFreqCutoff), 
(u,v) -> throw new IllegalArgumentException("Duplicate field name"), 
LinkedHashMap::new));
     }
{code}

This is totally untested, just as an idea! The merge function there is just 
stupid, but it's never called as there should be no duplicate keys. This 
actually is a bit more strict, if Field is wrongly implemented and returns the 
same field name multiple times in its iterator.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> --------------------------------------------------------
>
>                 Key: LUCENE-8041
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8041
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

Reply via email to