[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713678#comment-15713678 ]
Michael McCandless commented on LUCENE-7579: -------------------------------------------- Thanks [~jim.ferenczi], I also see comparable speedups on the taxis benchmark. I'll have a look at the change! It looks like a doozie :) > Sorting on flushed segment > -------------------------- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug > Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org