[
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841123#comment-16841123
]
Uwe Schindler commented on LUCENE-8041:
---------------------------------------
I have a maybe better suggestion instead of:
- Build a TreeMap like before and name it sortedField()
- Then rebuild it as LinkedHashMap
The problem here is that we have to build a possibly huge map, just to copy it
to a LinkedHashMap, and hotspot can't easily optimize away the creation on
heap. The result is fine (as said before): The LinkedHashMap needs a bit more
heap space than a plain HashMap, but should be of almost identical size to a
TreeMap (which uses a lot of small inner object instances).
My suggestion would be: You just iterate over the fields, do some
transformations and then finally add them to a TreeMap. Rewrite that to use
Java Streams - it may not work for all cases (especially if the iteration has
I/O involved and the order is important), but e.g. for direct fields it may
work (possibly the IOException needs to be wrapped):
{code:java}
public DirectFields(SegmentReadState state, Fields fields, int
minSkipCount, int lowFreqCutoff) throws IOException {
this.fields = StreamSupport.stream(fields.spliterator(), false)
.sorted() // <== that's the trick
.collect(Collectors.toMap(Function.identity(), field -> new
DirectField(state, field, fields.terms(field), minSkipCount, lowFreqCutoff),
(u,v) -> throw new IllegalArgumentException("Duplicate field name"),
LinkedHashMap::new));
}
{code}
This is totally untested, just as an idea! The merge function there is just
stupid, but it's never called as there should be no duplicate keys. This
actually is a bit more strict, if Field is wrongly implemented and returns the
same field name multiple times in its iterator.
> All Fields.terms(fld) impls should be O(1) not O(log(N))
> --------------------------------------------------------
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: David Smiley
> Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds. The O(log(N))
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time,
> if I recall. There are many Field implementations that are impacted... in
> part because Fields is the base class of FieldsProducer.
> As an aside, I hope Fields to go away some day; FieldsProducer should be
> TermsProducer and not have an iterator of fields. If DocValuesProducer
> doesn't have this then why should the terms index part of our API have it?
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap
> in many cases if we can assume when we initialize these internal maps that we
> consume them in sorted order to begin with.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]