[
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722610#comment-15722610
]
Michael McCandless commented on LUCENE-7579:
--------------------------------------------
This is a nice approach! Basically, the codec remains unaware index
sorting is happening, which is a the right way to do it. Instead, the
indexing chain takes care of it. And to build the doc comparators you take
advantage of the in-heap buffered doc values.
I like that to sort stored fields, you are still just using the codec
APIs, writing to temp files, then using the codec to read the stored
fields back for sorting.
I also like how you were able to re-use the {{SortingXXX}} from
{{SortingLeafReader}}. Later on we can maybe optimize some of these;
e.g. {{SortingFields}} and {{CachedXXXDVs}} should be able to take
advantage of the fact that the things they are sorting are all already
in heap (the indexing buffer), the way you did with
{{MutableSortingPointValues}} (cool).
Can we rename {{freezed}} to {{frozen}} in {{BinaryDocValuesWriter}}?
But: why would {{freezed}} ever be true when we call {{flush}}?
Shouldn't it only be called once, even in the sorting case?
I think the 6.x back port here is going to be especially tricky :)
Can we block creating a {{SortingLeafReader}} now (make its
constructor private)? We only now ever use its inner classes I think?
And it is a dangerous class in the first place... if we can do that,
maybe we rename it {{SortingCodecUtils}} or something, just for its
inner classes.
Do any of the exceptions tests for IndexWriter get angry? Seems like
if we hit an {{IOException}} e.g. during the renaming that
{{SortingStoredFieldsConsumer.flush}} does we may leave undeleted
files? Hmm or perhaps IW takes care of that by wrapping the directory
itself...
Can't you just pass {{sortMap::newToOld}} directly (method reference)
instead of making the lambda here?:
{noformat}
writer.sort(state.segmentInfo.maxDoc(), mergeReader, state.fieldInfos,
(docID) -> (sortMap.newToOld(docID)));
{noformat}
> Sorting on flushed segment
> --------------------------
>
> Key: LUCENE-7579
> URL: https://issues.apache.org/jira/browse/LUCENE-7579
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Ferenczi Jim
>
> Today flushed segments built by an index writer with an index sort specified
> are not sorted. The merge is responsible of sorting these segments
> potentially with others that are already sorted (resulted from another
> merge).
> I'd like to investigate the cost of sorting the segment directly during the
> flush. This could make the merge faster since they are some cheap
> optimizations that can be done only if all segments to be merged are sorted.
> For instance the merge of the points could use the bulk merge instead of
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the
> SortingLeafReader implementation to translate the values that we have in RAM
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first
> written to disk unsorted and then copied to another file with the correct
> sorting. I use the same stored field format for the two steps and just remove
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts
> and compared master with index sorting against my branch with index sorting
> on flush. I tried with sparsetaxis and wikipedia and the first results are
> weird. When I use the SerialScheduler and only one thread to write the docs,
> index sorting on flush is slower. But when I use two threads the sorting on
> flush is much faster even with the SerialScheduler. I'll continue to run the
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know
> this part at all so I did not fix the test yet. I don't even know if we can
> make it work with index sorting ;).
> [~mikemccand] I would love to have your feedback about the prototype. Could
> you please take a look ? I am sure there are plenty of bugs, ... but I think
> it's a good start to evaluate the feasibility of this feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]