[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826148#comment-15826148 ] ASF subversion and git services commented on LUCENE-7579: - Commit d73e3fb05c917739bcf0899171a024897d1b0269 in lucene-solr's branch refs/heads/branch_6x from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d73e3fb ] LUCENE-7579: fix 6.x backport compilation errors > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826002#comment-15826002 ] ASF subversion and git services commented on LUCENE-7579: - Commit 8f5b5a393d94500e6c7a8beff54e010c45c3b0e3 in lucene-solr's branch refs/heads/branch_6x from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8f5b5a3 ] LUCENE-7579: sort segments at flush too Segments are now also sorted during flush, and merging on a sorted index is substantially faster by using some of the same bulk merge optimizations that non-sorted merging uses (cherry picked from commit 4ccb9fb) > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826001#comment-15826001 ] ASF subversion and git services commented on LUCENE-7579: - Commit 7d96f9f7981dbadda837b5b2cacc3855d19f71aa in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7d96f9f ] LUCENE-7579: sort segments at flush too Segments are now also sorted during flush, and merging on a sorted index is substantially faster by using some of the same bulk merge optimizations that non-sorted merging uses > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825928#comment-15825928 ] Jim Ferenczi commented on LUCENE-7579: -- I've modified the version check for sorted segments {code}onOrAfter(Version.LUCENE_6_5_0){code} > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825926#comment-15825926 ] Michael McCandless commented on LUCENE-7579: bq. this backport makes me realize how much better master is by taking doc values APIs in its consumers rather than iterables of numbers or BytesRefs! ++ bq. I know we let it through on master, but now that I look at them again, I don't like the catch Trowable blocks we have around abort(), can get rid of them? Let's be sure to fix this (and other feedback here) in master too? Can you upgrade this {{assert}} in {{IndexWriter.java}} to instead throw a {{CorruptIndexException}}? {noformat} +} else if (segmentIndexSort == null) { + // Flushed segments are not sorted if they were built with a version prior to 6.4.0 + assert info.info.getVersion().onOrAfter(Version.LUCENE_6_4_0) == false; {noformat} Maybe that's overly paranoid, but I want to make sure we can safely assume this going forward: no segment should even be unsorted if you are using an index sort. In {{SortingLeafReader.java}} a small typo ({{fo BWC}} -> {{for BWC}}): {noformat} * {@link Sort}. This is package private and is only used by Lucene fo BWC when it needs to merge {noformat} Otherwise this looks great! It's a big change ... let's push it for jenkins to chew on! Thank you [~jim.ferenczi]. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825920#comment-15825920 ] Jim Ferenczi commented on LUCENE-7579: -- Thanks Mike ! Yes branch_6.x: 6.5 is the target. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825907#comment-15825907 ] Michael McCandless commented on LUCENE-7579: Oh yes I will have a look! Sorry for the delay! And since 6.4 is now branched we should push this to 6.x for future 6.5. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825856#comment-15825856 ] Jim Ferenczi commented on LUCENE-7579: -- [~mikemccand] I think the branch for the backport in the 6x branch is ready: https://github.com/apache/lucene-solr/compare/branch_6x...jimczi:flush_sort_6x?expand=1 Can you take a look ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15781152#comment-15781152 ] Jim Ferenczi commented on LUCENE-7579: -- Thanks @jpountz ! I pushed another iteration that hopefully addresses your comments. finalValues are now set in finish rather than in flush or getDocComparator. The LongSelector has been replaced by a LongToIntFunction and the duplicated code is removed. I've also removed the catch Throwable when we abort the stored fields consumer. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15780873#comment-15780873 ] Adrien Grand commented on LUCENE-7579: -- Thanks Jim, I just had a look. It looks good overall, this backport makes me realize how much better master is by taking doc values APIs in its consumers rather than iterables of numbers or BytesRefs! - In NumericDocValuesWriter and SortedNumericDocValuesWriter, I think it'd be cleaner to set finalValues in {{finish}} than in {{flush}} - Do we really need the count method on LongSelector? On a related note, it seems to me that we could save some copy-pasting by extracting the logic to get a long value from a given doc id? Right now for all sort fields we duplicate the logic that first checks docsWithField to return the missing value twice. - OrdSelector looks unused I know we let it through on master, but now that I look at them again, I don't like the {{catch Trowable}} blocks we have around abort(), can get rid of them? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15780577#comment-15780577 ] Jim Ferenczi commented on LUCENE-7579: -- I have a candidate branch for the backport in 6.x: https://github.com/apache/lucene-solr/compare/branch_6x...jimczi:flush_sort_6x?expand=1 I had to adapt the code to use the random access DocValues so it's more a rewrite than a back port. [~jpountz] [~mikemccand] could you please take a look ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Jim Ferenczi > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764963#comment-15764963 ] Michael McCandless commented on LUCENE-7579: Wow this change gave a big jump in indexing throughput when index sorting is used: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput You can see the flush time went up but the merge time went way down. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764205#comment-15764205 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} Maybe you can work on the 6.x back port in the meantime {quote} I am on it ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764013#comment-15764013 ] Michael McCandless commented on LUCENE-7579: Thank you [~jim.ferenczi]! I just pushed your last (squashed) commits to master ... let's let it bake for a while. Maybe you can work on the 6.x back port in the meantime :) > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764010#comment-15764010 ] ASF subversion and git services commented on LUCENE-7579: - Commit 4ccb9fbd2bbc3afd075aa4bc2b6118f845ea4726 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4ccb9fb ] LUCENE-7579: sort segments at flush too > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764001#comment-15764001 ] Ferenczi Jim commented on LUCENE-7579: -- Thanks [~jpountz] and [~mikemccand] ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763899#comment-15763899 ] Adrien Grand commented on LUCENE-7579: -- bq. I think (later, separate issue) we could use a more naive stored fields (and term vectors) format for the temp files written at flush ... a format that does no compression, just writes bytes to disk, maybe has simple in-memory array pointing to offset in the file for each document ... +1 > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763859#comment-15763859 ] Michael McCandless commented on LUCENE-7579: Patch looks great to me! I think (later, separate issue) we could use a more naive stored fields (and term vectors) format for the temp files written at flush ... a format that does no compression, just writes bytes to disk, maybe has simple in-memory array pointing to offset in the file for each document ... this format would be package private to oal.index. Later! This patch is great, progress not perfection. bq. I think we should look into forbidding that (in a different issue). +1 I'll merge this soon to master so we can get it baking ... > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763808#comment-15763808 ] Adrien Grand commented on LUCENE-7579: -- Thanks, the diff looks good to me! bq. IndexWriter.addIndexes uses the merge to add indices that could be unsorted I think we should look into forbidding that (in a different issue). > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15760882#comment-15760882 ] Ferenczi Jim commented on LUCENE-7579: -- I pushed another commit that removes the specialized API for sorting a StoredFieldsWriter. This is now done directly in the StoredFieldsConsumer with a custom CopyVisitor (copied from MergeVisitor). I've also added some asserts that check if unsorted segments were built from version prior to Lucene 7.0. We'll need to change the assert when this gets backported to 6.x. I could not add the assert on maybeSortReaders because IndexWriter.addIndexes uses the merge to add indices that could be unsorted. I don't know if this should be allowed or not but we can revisit this later. Other than that I think it's ready ! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754026#comment-15754026 ] Adrien Grand commented on LUCENE-7579: -- +1 > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753993#comment-15753993 ] Ferenczi Jim commented on LUCENE-7579: -- This new API is maybe a premature optim that should not be part of this change. What about removing the API and rollback to a non optimized copy that "visits" each doc and copy it like the StoredFieldsReader is doing? This way the function would be private on the StoredFieldsConsumer. We can still add the optimization you're describing later but it can be confusing if the writes of the index writer are not compressed the same way than the other writes for stored fields ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753957#comment-15753957 ] Adrien Grand commented on LUCENE-7579: -- bq. I am not happy that I had to add this new public API in the StoredFieldsReader but it's the only way to make this optimized for the compressing case. I was thinking about it too and I suspect the optimization does not bring much in the case that blocks contain multiple documents (ie. small docs) since I would expect the fact that sorting the stored fields format keeps decompressing blocks of 16KB for every single document to be the bottleneck? Maybe we should not try to reuse the codec's stored fields format for the temporary stored fields and rather do the buffering in memory or on disk with a custom format that has faster random-access? I would expect it to be faster in many cases, and would allow to get rid of this new API? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753913#comment-15753913 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} CompressingStoredFieldsWriter.sort should always have a CompressingStoredFieldsReader as an input, since the codec cannot change in the middle of the flush, so I think we should be able to skip the instanceof check? {quote} That's true for the only call we make to this new API but since it's public it could be call with a different fields reader in another use case ? I am not happy that I had to add this new public API in the StoredFieldsReader but it's the only way to make this optimized for the compressing case. {quote} It would personally help me to have comments eg. in MergeState.maybeSortReaders that the indexSort==null case may only happen for bwc reasons. Maybe we should also assert that if index sorting is configured, then the non-sorted segments can only have 6.2 or 6.3 as a version {quote} Agreed, I'll add an assert for the non-sorted case. I'll also add a comment to make it clear that index==null is handled for BWC reason in maybeSortReader. Thanks for having a look [~jpountz] > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15751527#comment-15751527 ] Adrien Grand commented on LUCENE-7579: -- Some questions/comments: * CompressingStoredFieldsWriter.sort should always have a CompressingStoredFieldsReader as an input, since the codec cannot change in the middle of the flush, so I think we should be able to skip the instanceof check? * It would personally help me to have comments eg. in MergeState.maybeSortReaders that the {{indexSort==null}} case may only happen for bwc reasons. Maybe we should also assert that if index sorting is configured, then the non-sorted segments can only have 6.2 or 6.3 as a version. Thanks for working on this change! > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750937#comment-15750937 ] Ferenczi Jim commented on LUCENE-7579: -- {quote} We still need to wrap unsorted segments during the merge for BWC so SortingLeafReader should remain. {quote} We can still rewrite it to a SortingCodecReader and remove the SlowCodecReaderWrapper but that's another issue ;) {quote} I think we should push first to master, and let that bake some, and in the mean time work out the challenging 6.x back port? {quote} Agreed. I'll create a branch for the back port in my repo. {quote} I'll wait a day or so before committing to give others a chance to review; it's a large change. {quote} That's awesome [~mikemccand] ! Thanks for the review and testing. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749534#comment-15749534 ] Michael McCandless commented on LUCENE-7579: Wow, this patch brings the "sparse sorted" indexing time down from 448.5 seconds on master as a few days ago, to 299.8 seconds, a 33% speedup! Nice. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749502#comment-15749502 ] Michael McCandless commented on LUCENE-7579: bq. We still need to wrap unsorted segments during the merge for BWC so SortingLeafReader should remain. OK I agree, oh well :) The latest squashed commit looks great; it passes tests and precommit for me. I'll test with sparse taxis benchmark too, but this looks ready! I noticed you also optimized merging of stored fields in the sorted case, when field infos are congruent (common), by permuting the docIDs to the sort, but simply bulk-copying the already serialized bytes. I'll wait a day or so before committing to give others a chance to review; it's a large change. I think we should push first to master, and let that bake some, and in the mean time work out the challenging 6.x back port? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745571#comment-15745571 ] Michael McCandless commented on LUCENE-7579: Thanks [~jim.ferenczi], I'll have a look. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744895#comment-15744895 ] Ferenczi Jim commented on LUCENE-7579: -- I pushed another iteration to https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort I cleaned up the nocommit and added the implementation for sorting term vectors. {quote} Do any of the exceptions tests for IndexWriter get angry? Seems like if we hit an IOException e.g. during the renaming that SortingStoredFieldsConsumer.flush does we may leave undeleted files? Hmm or perhaps IW takes care of that by wrapping the directory itself... {quote} I added an abort method on the StoredFieldsWriter which deletes the remaining temporary files and did the same for the SortingTermVectorsConsumer. [~mikemccand] can you take a look ? > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723545#comment-15723545 ] Ferenczi Jim commented on LUCENE-7579: -- Thanks Mike, {quote} Can we rename freezed to frozen in BinaryDocValuesWriter? But: why would freezed ever be true when we call flush? Shouldn't it only be called once, even in the sorting case? {quote} This is a leftover that is not needed. The naming was wrong ;) and it's useless so I removed it. {quote} I also like how you were able to re-use the SortingXXX from SortingLeafReader. Later on we can maybe optimize some of these; e.g. SortingFields and CachedXXXDVs should be able to take advantage of the fact that the things they are sorting are all already in heap (the indexing buffer), the way you did with MutableSortingPointValues (cool). {quote} Totally agree, we can revisit later and see if we can optimize memory. I think it's already an optim vs master in terms of memory usage since we only "sort" the segment to be flushed instead of all "unsorted" segments during the merge. {quote} Can we block creating a SortingLeafReader now (make its constructor private)? We only now ever use its inner classes I think? And it is a dangerous class in the first place... if we can do that, maybe we rename it SortingCodecUtils or something, just for its inner classes. {quote} We still need to wrap unsorted segments during the merge for BWC so SortingLeafReader should remain. I have no idea when we can remove it since indices on older versions should still be compatible with this new one ? {quote} Do any of the exceptions tests for IndexWriter get angry? Seems like if we hit an IOException e.g. during the renaming that SortingStoredFieldsConsumer.flush does we may leave undeleted files? Hmm or perhaps IW takes care of that by wrapping the directory itself... {quote} Honestly I have no idea. I will dig. {quote} Can't you just pass sortMap::newToOld directly (method reference) instead of making the lambda here?: {quote} Indeed, thanks. {quote} I think the 6.x back port here is going to be especially tricky {quote} I bet but as it is the main part is done by reusing SortingLeafReader inner classes that exist in 6.x. I've also removed a nocommit in the AssertingLiveDocsFormat that now checks live docs even when they are sorted. > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature.
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722610#comment-15722610 ] Michael McCandless commented on LUCENE-7579: This is a nice approach! Basically, the codec remains unaware index sorting is happening, which is a the right way to do it. Instead, the indexing chain takes care of it. And to build the doc comparators you take advantage of the in-heap buffered doc values. I like that to sort stored fields, you are still just using the codec APIs, writing to temp files, then using the codec to read the stored fields back for sorting. I also like how you were able to re-use the {{SortingXXX}} from {{SortingLeafReader}}. Later on we can maybe optimize some of these; e.g. {{SortingFields}} and {{CachedXXXDVs}} should be able to take advantage of the fact that the things they are sorting are all already in heap (the indexing buffer), the way you did with {{MutableSortingPointValues}} (cool). Can we rename {{freezed}} to {{frozen}} in {{BinaryDocValuesWriter}}? But: why would {{freezed}} ever be true when we call {{flush}}? Shouldn't it only be called once, even in the sorting case? I think the 6.x back port here is going to be especially tricky :) Can we block creating a {{SortingLeafReader}} now (make its constructor private)? We only now ever use its inner classes I think? And it is a dangerous class in the first place... if we can do that, maybe we rename it {{SortingCodecUtils}} or something, just for its inner classes. Do any of the exceptions tests for IndexWriter get angry? Seems like if we hit an {{IOException}} e.g. during the renaming that {{SortingStoredFieldsConsumer.flush}} does we may leave undeleted files? Hmm or perhaps IW takes care of that by wrapping the directory itself... Can't you just pass {{sortMap::newToOld}} directly (method reference) instead of making the lambda here?: {noformat} writer.sort(state.segmentInfo.maxDoc(), mergeReader, state.fieldInfos, (docID) -> (sortMap.newToOld(docID))); {noformat} > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713678#comment-15713678 ] Michael McCandless commented on LUCENE-7579: Thanks [~jim.ferenczi], I also see comparable speedups on the taxis benchmark. I'll have a look at the change! It looks like a doozie :) > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7579) Sorting on flushed segment
[ https://issues.apache.org/jira/browse/LUCENE-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712572#comment-15712572 ] Ferenczi Jim commented on LUCENE-7579: -- I ran the test from a clean state and I can see a nice improvement with the sparsetaxis use case. I use https://github.com/mikemccand/luceneutil/blob/master/src/python/sparsetaxis/runBenchmark.py and compare two checkouts of Lucene, one with my branch and the other with master. For the master branch I have: {noformat} 838.0 sec: 20.0 M docs; 23.9 K docs/sec {noformat} ... vs the branch with the flush sort: {noformat} 612.2 sec: 20.0 M docs; 32.7 K docs/sec {noformat} I reproduce the same diff on each run :) > Sorting on flushed segment > -- > > Key: LUCENE-7579 > URL: https://issues.apache.org/jira/browse/LUCENE-7579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ferenczi Jim > > Today flushed segments built by an index writer with an index sort specified > are not sorted. The merge is responsible of sorting these segments > potentially with others that are already sorted (resulted from another > merge). > I'd like to investigate the cost of sorting the segment directly during the > flush. This could make the merge faster since they are some cheap > optimizations that can be done only if all segments to be merged are sorted. > For instance the merge of the points could use the bulk merge instead of > rebuilding the points from scratch. > I made a small prototype which sort the segment on flush here: > https://github.com/apache/lucene-solr/compare/master...jimczi:flush_sort > The idea is simple, for points, norms, docvalues and terms I use the > SortingLeafReader implementation to translate the values that we have in RAM > in a sorted enumeration for the writers. > For stored fields I use a two pass scheme where the documents are first > written to disk unsorted and then copied to another file with the correct > sorting. I use the same stored field format for the two steps and just remove > the file produced by the first pass at the end of the process. > This prototype has no implementation for index sorting that use term vectors > yet. I'll add this later if the tests are good enough. > Speaking of testing, I tried this branch on [~mikemccand] benchmark scripts > and compared master with index sorting against my branch with index sorting > on flush. I tried with sparsetaxis and wikipedia and the first results are > weird. When I use the SerialScheduler and only one thread to write the docs, > index sorting on flush is slower. But when I use two threads the sorting on > flush is much faster even with the SerialScheduler. I'll continue to run the > tests in order to be able to share something more meaningful. > The tests are passing except one about concurrent DV updates. I don't know > this part at all so I did not fix the test yet. I don't even know if we can > make it work with index sorting ;). > [~mikemccand] I would love to have your feedback about the prototype. Could > you please take a look ? I am sure there are plenty of bugs, ... but I think > it's a good start to evaluate the feasibility of this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org