[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

Michael McCandless (JIRA) Tue, 17 Jan 2017 03:51:54 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825926#comment-15825926
 ]


Michael McCandless commented on LUCENE-7579:
--------------------------------------------

bq. this backport makes me realize how much better master is by taking doc 
values APIs in its consumers rather than iterables of numbers or BytesRefs!

++

bq. I know we let it through on master, but now that I look at them again, I 
don't like the catch Trowable blocks we have around abort(), can get rid of 
them?

Let's be sure to fix this (and other feedback here) in master too?

Can you upgrade this {{assert}} in {{IndexWriter.java}} to instead throw a 
{{CorruptIndexException}}?

{noformat}
+        } else if (segmentIndexSort == null) {
+          // Flushed segments are not sorted if they were built with a version 
prior to 6.4.0
+          assert info.info.getVersion().onOrAfter(Version.LUCENE_6_4_0) == 
false;
{noformat}

Maybe that's overly paranoid, but I want to make sure we can safely assume this 
going forward: no segment should even be unsorted if you are using an index 
sort.

In {{SortingLeafReader.java}} a small typo ({{fo BWC}} -> {{for BWC}}):

{noformat}
* {@link Sort}. This is package private and is only used by Lucene fo BWC when 
it needs to merge
{noformat}

Otherwise this looks great!  It's a big change ... let's push it for jenkins to 
chew on!  Thank you [~jim.ferenczi].

> Sorting on flushed segment
> --------------------------
>
>                 Key: LUCENE-7579
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7579
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jim Ferenczi
>
> Today flushed segments built by an index writer with an index sort specified 
> are not sorted. The merge is responsible of sorting these segments 
> potentially with others that are already sorted (resulted from another 
> merge). 
> I'd like to investigate the cost of sorting the segment directly during the 
> flush. This could make the merge faster since they are some cheap 
> optimizations that can be done only if all segments to be merged are sorted.
>  For instance the merge of the points could use the bulk merge instead of 
> rebuilding the points from scratch.
> I made a small prototype which sort the segment on flush here:
> https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort
> The idea is simple, for points, norms, docvalues and terms I use the 
> SortingLeafReader implementation to translate the values that we have in RAM 
> in a sorted enumeration for the writers.
> For stored fields I use a two pass scheme where the documents are first 
> written to disk unsorted and then copied to another file with the correct 
> sorting. I use the same stored field format for the two steps and just remove 
> the file produced by the first pass at the end of the process.
> This prototype has no implementation for index sorting that use term vectors 
> yet. I'll add this later if the tests are good enough.
> Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts 
> and compared master with index sorting against my branch with index sorting 
> on flush. I tried with sparsetaxis and wikipedia and the first results are 
> weird. When I use the SerialScheduler and only one thread to write the docs,  
> index sorting on flush is slower. But when I use two threads the sorting on 
> flush is much faster even with the SerialScheduler. I'll continue to run the 
> tests in order to be able to share something more meaningful.
> The tests are passing except one about concurrent DV updates. I don't know 
> this part at all so I did not fix the test yet. I don't even know if we can 
> make it work with index sorting ;).
>  [~mikemccand] I would love to have your feedback about the prototype. Could 
> you please take a look ? I am sure there are plenty of bugs, ... but I think 
> it's a good start to evaluate the feasibility of this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7579) Sorting on flushed segment

Reply via email to