[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791049#comment-16791049 ] Toke Eskildsen commented on LUCENE-8585: I need to check my mail filters to discover why I did not see this before now. I apologize and thank [~romseygeek] and [~jpountz] for handling it. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Fix For: 8.0 > > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781536#comment-16781536 ] Alan Woodward commented on LUCENE-8585: --- Adrian and I had a look at this, and we think it's a test bug - BaseNormsFormatTestCase specifies NoMergePolicy, and this particular run of the test was using the HandleLimitFS which prevents more than 2048 file handles being opened, which doesn't sit well with lots of small segments being generated and never merged. There's no particular reason for it to not merge things, though. We should also change TestLucene70NormsFormat to test the correct Codec. I'll close this and open a new issue for these two fixes. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Fix For: 8.0 > > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766949#comment-16766949 ] Toke Eskildsen commented on LUCENE-8585: I have no outstanding issues, so resolving is fine. Further performance-oriented testing form my part has been delayed due to outside factors. There might well be further performance-oriented tweaks that would be beneficial, but as a first implementation, I consider LUCENE-8585 to be done. For the potential performance-tweaks, I will open another issue for later Lucene/solr releases. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Fix For: 8.0 > > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752100#comment-16752100 ] Adrien Grand commented on LUCENE-8585: -- I just pushed the above patch. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748610#comment-16748610 ] Toke Eskildsen commented on LUCENE-8585: Skipping the method call makes sense, [~jpountz], thanks. The synthetic accessor on private methods is new to me (I have now read up on it and understand the problem), so thanks for the enlightment - there's some older code elsewhere I have to re-visit with that in mind. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747950#comment-16747950 ] Adrien Grand commented on LUCENE-8585: -- [~toke] I was just playing with this new doc-value format and while it seems to perform generally as well as the previous format or better, it seems a bit slower when doing lots of calls to advanceExact by small gaps. I looked at how things get compiled and was able to fix the regression by moving early exits in rankSkip at the call site: because rankSkip is too large to be inlined, if you keep calling advanceExact with small gaps then you will always perform a method call only to figure out than rank skipping doesn't help. Moving the exit conditions to the call site avoids the method call. Here is what the patch looks like (I also removed private modifiers from members of this class as this would create synthetic accessors when used from the Method constants): {noformat} diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java index 8ddb93e..520d1d4 100644 --- a/lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java +++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java @@ -254,13 +254,15 @@ final class IndexedDISI extends DocIdSetIterator { return (short)blockCount; } + // Members are pkg-private to avoid synthetic accessors when accessed from the `Method` enum + /** The slice that stores the {@link DocIdSetIterator}. */ - private final IndexInput slice; - private final int jumpTableEntryCount; - private final byte denseRankPower; - private final RandomAccessInput jumpTable; // Skip blocks of 64K bits - private final byte[] denseRankTable; - private final long cost; + final IndexInput slice; + final int jumpTableEntryCount; + final byte denseRankPower; + final RandomAccessInput jumpTable; // Skip blocks of 64K bits + final byte[] denseRankTable; + final long cost; /** * This constructor always creates a new blockSlice and a new jumpTable from in, to ensure that operations are @@ -344,28 +346,28 @@ final class IndexedDISI extends DocIdSetIterator { } } - private int block = -1; - private long blockEnd; - private long denseBitmapOffset = -1; // Only used for DENSE blocks - private int nextBlockIndex = -1; + int block = -1; + long blockEnd; + long denseBitmapOffset = -1; // Only used for DENSE blocks + int nextBlockIndex = -1; Method method; - private int doc = -1; - private int index = -1; + int doc = -1; + int index = -1; // SPARSE variables - private boolean exists; + boolean exists; // DENSE variables - private long word; - private int wordIndex = -1; + long word; + int wordIndex = -1; // number of one bits encountered so far, including those of `word` - private int numberOfOnes; + int numberOfOnes; // Used with rank for jumps inside of DENSE as they are absolute instead of relative - private int denseOrigoIndex; + int denseOrigoIndex; // ALL variables - private int gap; + int gap; @Override public int docID() { @@ -514,7 +516,11 @@ final class IndexedDISI extends DocIdSetIterator { final int targetWordIndex = targetInBlock >>> 6; // If possible, skip ahead using the rank cache -rankSkip(disi, target); +// If the distance between the current position and the target is < rank-longs +// there is no sense in using rank +if (disi.denseRankPower != -1 && targetWordIndex - disi.wordIndex >= (1 << (disi.denseRankPower-6) )) { + rankSkip(disi, targetInBlock); +} for (int i = disi.wordIndex + 1; i <= targetWordIndex; ++i) { disi.word = disi.slice.readLong(); @@ -550,7 +556,12 @@ final class IndexedDISI extends DocIdSetIterator { final int targetInBlock = target & 0x; final int targetWordIndex = targetInBlock >>> 6; -rankSkip(disi, target); +// If possible, skip ahead using the rank cache +// If the distance between the current position and the target is < rank-longs +// there is no sense in using rank +if (disi.denseRankPower != -1 && targetWordIndex - disi.wordIndex >= (1 << (disi.denseRankPower-6) )) { + rankSkip(disi, targetInBlock); +} for (int i = disi.wordIndex + 1; i <= targetWordIndex; ++i) { disi.word = disi.slice.readLong(); @@ -594,23 +605,11 @@ final class IndexedDISI extends DocIdSetIterator { * Note: This does not guarantee a skip up to target, only up to nearest rank boundary. It is the * responsibility of the caller to iterate further to reach target. * @param disi standard DISI. - * @param target the wanted docID for which to calculate set-flag and index. + * @param targetInBlock lower 16 bits
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747934#comment-16747934 ] Toke Eskildsen commented on LUCENE-8585: Thank you for cleaning up my mess, [~erickerickson]. I have cherry-picked it along with the original LUCENE-8585 commit from {{master}} to {{branch_8x}}, which I hope is the correct action. I consider LUCENE-8585 shipped and will experiment with different performance tests for larger-scale shards in the coming weeks. I will keep an eye on nightly benchmarks (Mike's and Elastic's) to check for regressions. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747925#comment-16747925 ] ASF subversion and git services commented on LUCENE-8585: - Commit 30c971fb65845f4ae14ba2c4b9ab0c26387bf5a8 in lucene-solr's branch refs/heads/branch_8x from Erick Erickson [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=30c971f ] LUCENE-8585: fix precommit failure (cherry picked from 73d1b07f8e2304bcca06c682c2bbd8738f7b8522) > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747924#comment-16747924 ] ASF subversion and git services commented on LUCENE-8585: - Commit 5bb6a5b42c406dcc072554527f59b10ff41f2a8c in lucene-solr's branch refs/heads/branch_8x from Toke Eskildsen [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5bb6a5b ] LUCENE-8585: Create jump-tables for DocValues at index-time (cherry picked from c13645bd4c65f052e7f35df8e33d3dbfe6d7cd58) > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10.5h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746914#comment-16746914 ] Erick Erickson commented on LUCENE-8585: Fixed for master, file isn't present in 8x. Should be OK now. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746913#comment-16746913 ] ASF subversion and git services commented on LUCENE-8585: - Commit 73d1b07f8e2304bcca06c682c2bbd8738f7b8522 in lucene-solr's branch refs/heads/master from Erick Erickson [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=73d1b07 ] LUCENE-8585: fix precommit failure > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746904#comment-16746904 ] Erick Erickson commented on LUCENE-8585: This commit broke precommit, I'm fixing now. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746703#comment-16746703 ] ASF subversion and git services commented on LUCENE-8585: - Commit c13645bd4c65f052e7f35df8e33d3dbfe6d7cd58 in lucene-solr's branch refs/heads/master from Toke Eskildsen [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c13645b ] LUCENE-8585: Create jump-tables for DocValues at index-time > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746300#comment-16746300 ] Toke Eskildsen commented on LUCENE-8585: Doing it now with the current version, but it takes 16 hours to run an index upgrade with my fun-size shard, so it is a weekend project. My old benchmarks took advantage of being able to turn the jump-tables on & off on the fly; there is a bit more hand-holding with the always-on. Given plenty of time, I would like to spend a week of performance-testing and possibly tweaking DENSE-rank size before merging to {{master}}. As things are, I think it is better to push to {{master}} first and then do adjustments (if needed) based on my own and other benchmark data. I have created a patch, applied it to {{master}} and am unit-testing it right now. I plan to push directly to the GitHub-repo from my local check out and then close the pull-issue, as that ensures a single commit and keeps the review history in a usable format. If my logic is flawed, please advice on the correct procedure. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746256#comment-16746256 ] Adrien Grand commented on LUCENE-8585: -- bq. Considering your involvement in this, I suggest I set the credits in the CHANGES.txt to (Toke Eskildsen, Adrien Grand)? This sounds good to me. I'm curious if you have run your benchmarks again with this new format? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745915#comment-16745915 ] Alan Woodward commented on LUCENE-8585: --- We're still a way off a release candidate, so feel free to commit to master and 8x > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745859#comment-16745859 ] Toke Eskildsen commented on LUCENE-8585: Ah yes, {{TestLucene80DocValuesFormat}}, thanks. I'll await the ping from [~romseygeek] and if positive, merge {{master}} in, do a last {{ant test}} and merge back to {{master}}. Considering your involvement in this, I suggest I set the credits in the {{CHANGES.txt}} to (Toke Eskildsen, Adrien Grand)? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745344#comment-16745344 ] Adrien Grand commented on LUCENE-8585: -- bq. BaseDocValuesFormatTestCase focuses on Variable Bits Per Value block jumps Did you mean TestLucene80DocValuesFormat? +1 to merge this PR bq. I need a bit of guidance on how to fit this into the 8.0-release. I haven't seen a call to cut a RC yet so I assume it's ok to get this change in [~romseygeek]? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10h > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745060#comment-16745060 ] Toke Eskildsen commented on LUCENE-8585: I am the one thanking for top-notch feedback. The current pull-request is way better than the first take. I have cleaned up the jump-tables relevant testing by pulling that code out of the general {{BaseDocValuesFormatTestCase}} to {{TestLucene80DocValuesFormat}} and expanded it a bit. {{TestIndexedDISI}} tests jump-tabls at {{IndexedDIDI}}-level, so {{BaseDocValuesFormatTestCase}} focuses on Variable Bits Per Value block jumps. This reduces the test time overhead significantly and should address the rest of the issues you have raised on the pull request. Our schedules seem to fit poorly, so there have been long delays in our communication. I'll prioritize to react quickly during the next week, in the hope that we can get this pushed to master. I need a bit of guidance on how to fit this into the 8.0-release. Shall I just merge the {{master}}-branch into the {{LUCENE-8585}}-branch to keep the pull-request current? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 9h 40m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740841#comment-16740841 ] Adrien Grand commented on LUCENE-8585: -- Thanks Toke for all the iterations. There are still some comments with minor suggestions that would be good to get in, but it looks good to me in general. bq. there are already block-spanning tests in place for the lucene 80 codec, so this is "just" about coverage Do these tests also exercise the logic that only uses jump tables if the target doc ID is sufficiently far forward? Let's remove commented out tests in the base test case? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 9h 40m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738843#comment-16738843 ] Adrien Grand commented on LUCENE-8585: -- Thanks Toke. Regarding timing, I know [~romseygeek] wants to do cleanups of stuff that got deprecated in 7, which often helped uncover issues before cutting a release, plus we just set up Jenkins jobs. I'm traveling this week but I'll have another look as soon as I can, I should be able to perform another pass by Friday. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 9h 20m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737150#comment-16737150 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] I have implemented all the code changes you suggested on the pull request. Still pending/under discussion is the 200K test-docs in {{BaseDocValuesFormatTestCase.doTestNumericsVsStoredFields}} - there are already block-spanning tests in place for the lucene 80 codec, so this is "just" about coverage. I see that the 8.0-ball is beginning to roll. How do you see the status of LUCENE-8585 and how does it fit into the 8.0-process? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 9h 20m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726857#comment-16726857 ] Adrien Grand commented on LUCENE-8585: -- I'm fine either way. Even if we don't add it today, it will be super easy to add in the future by checking the version when reading the meta to use a default value for indices that had been created before this was made configurable. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 5h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726684#comment-16726684 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] A colleague suggested that we make the {{DENSE}} rank span adjustable or at least plan for it. Right now it is fixed as an entry for every 8 longs and as we discussed earlier, that is just a qualified guess as to what works best. If we record the span in {{meta}} (only accepting powers of 2 for clean shifts & masks) and hardcode it to 8 in the consumer for now, it is just a single {{short}} spend to allow for future tweaking, without having to change the underlying codec. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 5h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726650#comment-16726650 ] Adrien Grand commented on LUCENE-8585: -- Thanks Toke, I'll have a look today. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 5h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726304#comment-16726304 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] The problem was indeed the initial position in the slice. Index upgrade now works as expected and the fix is pushed to the GitHub pull request. Besides the unit-test coverage consideration, I have no other things on the to-do for LUCENE-8585. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 5h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725907#comment-16725907 ] Toke Eskildsen commented on LUCENE-8585: Quick note, [~jpountz]: I cannot run and index upgrade without getting an exception. I have pin-pointed it to the re-use of previously created slices in {{Lucene80NormsProducer}} {{getDataInput}} & {{getDisiInput}} that is activated by {{merging}}. It works fine if I disable re-use there. Hopefully it is just a question of ignoring the current position of the given slice in the {{IndexedDISI}} constructor as the old code indicates that the {{IndexedDISI}}-data always starts at offset 0 in the slice. I hope to find the time to fix that soon, but things are a but busy the next week. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 5h 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724857#comment-16724857 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] I have implemented the rank pre-loading and pushed it to GitHub. It took just an hour or two, followed by a day of tedious debugging to discover that I shifted some bits the wrong way and that I had not checked that part even though I thought I had. Sigh. I'll experiment with index-upgrading. Still pending is how to handle the {{doTestNumericsVsStoredFields}} unit-test: It did help me find and pinpoint the shift-problem when I ran it with 20K documents and testing all jump-length possibilities, but doing that bumps the full {{TestLucene80DocValuesFormat}} to ~15 minutes, up from ~1 minute. I have noted the +1 and I have prioritized working on LUCENE-8585 for the next weeks (interrupted by hollidays). > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 4h 50m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724413#comment-16724413 ] Adrien Grand commented on LUCENE-8585: -- My understanding of [~romseygeek]'s proposal to cut a branch is to start cleaning up master rather than freezing 8.0. I'd expect another call for a feature freeze when the time comes. I'm also supportive of having it in 8.0 so that we won't have to keep the 7.x codec until 9.x. Please ping me when the PR is ready for another round of review. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 4h 50m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724407#comment-16724407 ] David Smiley commented on LUCENE-8585: -- +1 for this to be incorporated into 8.0. For some people, this is going to be a huge optimization. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 4h 50m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723918#comment-16723918 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] how do you see the chances of getting this in 8.0? Freezing seems awfully close and a I would expect some baking time for a change such as LUCENE-8585. On the other hand, it seems a bit backwards to roll a 8.0 with a planned codec-update pending for 8.1. If you are in doubt (as I am), should we discuss this in the relevant thread on the mailing list? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 4h 50m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719464#comment-16719464 ] Toke Eskildsen commented on LUCENE-8585: Just a quick note: Running {{ant test -Dtestcase=TestLucene80DocValuesFormat}} took 1:19 minutes with 300 documents and 7:30 minutes with 200K. There are about 120 tests in {{BaseDocValuesFormatTestCase}} and 14 of them calls {{doTestNumericsVsStoredFields}}. Back-of-the-envelope: Increasing to 200K documents adds half a minute for each of the 14 tests. They all pass BTW. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719252#comment-16719252 ] Toke Eskildsen commented on LUCENE-8585: Thank you for reviewing, [~jpountz]. I'll try and run an index-upgrade on one of our large shards and measure the difference. I'll also take another look at {{doTestNumericsVsStoredFields}} to see if anything can be done there. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719167#comment-16719167 ] Adrien Grand commented on LUCENE-8585: -- Thanks Toke, I'll give it a look by the end of the week. bq. I could make it switch from 300 to 200,000 when running Nightly or I could hand-pick some of the tests and increase documents for them, which would mean worse coverage but better speed? This trade-off is hard indeed. We should try to optimize coverage while keeping the test reasonably fast, I guess it should run in under 10 seconds or so. Maybe there are things that we can improve in tests that are dedicated for the sparse case, such as avoiding tiny flushes or too aggressive merge settings. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719050#comment-16719050 ] Toke Eskildsen commented on LUCENE-8585: I have cleaned up, moved old lucene70 codec classes and in general tried to finish the job. For a change of pace and ease of review, I have created a pull-request instead of a patch. I'll create a patch, should anyone want that instead. I have a single pending issue with unit-testing: The method {{BaseDocValuesformatTestCase.doTestNumericsVsStoredFields}} is used a lot and currently operates with 300 documents. This is far from enough when testing jumps. Upping it to 200,000 means that jumping can be implicitly tested for all the different test-cases in {{BaseDocValuesformatTestCase}}, but that increases processing time a lot. I could make it switch from 300 to 200,000 when running {{Nightly}} or I could hand-pick some of the tests and increase documents for them, which would mean worse coverage but better speed? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > Time Spent: 10m > Remaining Estimate: 0h > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717136#comment-16717136 ] ASF subversion and git services commented on LUCENE-8585: - Commit 1da6d39b417148a3a5197931c92b6e8d409059bc in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1da6d39 ] Revert "LUCENE-8374 part 2/4: Reduce reads for sparse DocValues". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit 7ad027627a179daa7d8d56be191d5b287dfec6f4. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717139#comment-16717139 ] ASF subversion and git services commented on LUCENE-8585: - Commit 8a20705b82272352ffcef8a18a7e8f96b2c05a7b in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8a20705 ] Revert "LUCENE-8374 part 1/4: Reduce reads for sparse DocValues". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit 58a7a8ada5cebeb261060c56cd6d0a9446478bf6. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717129#comment-16717129 ] ASF subversion and git services commented on LUCENE-8585: - Commit 3158d0c485449400d35a0095db15acdce9f8db5c in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3158d0c ] Revert "LUCENE-8374 part 4/4: Reduce reads for sparse DocValues". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit e356d793caf2a899f23261baba922d4a08b362ed. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717133#comment-16717133 ] ASF subversion and git services commented on LUCENE-8585: - Commit 6c5d87a505e4ed5803c41c900a374791bb033a0b in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c5d87a ] Revert "LUCENE-8374 part 3/4: Reduce reads for sparse DocValues". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit 7949b98f802c9ab3a588a33cdb1771b83c9fcafb. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717126#comment-16717126 ] ASF subversion and git services commented on LUCENE-8585: - Commit 870bb11cc85c961f9ca9fb638143d8189a5abd6a in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=870bb11 ] Revert "Pre-commit fixes for LUCENE-8374 (JavaDoc + arguments)". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit 6c11168ceda0837f25a27e5b4549f2693457. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713081#comment-16713081 ] Adrien Grand commented on LUCENE-8585: -- bq. Since the norms-classes also uses IndexedDISI, I expect it would be best to upgrade them too. This would leave the core lucene70 folder empty of active code. +1 to improve norms at the same time > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712878#comment-16712878 ] Toke Eskildsen commented on LUCENE-8585: Thank you for the clarifications, [~jpountz]. Regarding where to put the jump-data: {quote}If the access pattern is sequential, which I assume would be the case in both cases, then it's fine to keep them on storage. {quote} Well, that really depends on the access pattern from the outside ;). But as the jump-entries are stored sequentially then a request hitting a smaller subset of the documents in a manner that will benefit from jumps means that the jump-entries will be accessed in increasing order. They won't be used if the jumps are within the current block or to the block immediately following the current one. {quote}We can also move the 7.0 format to lucene/backward-codecs since lucene/core only keeps formats that are used for the current codec. {quote} Before I began there was a single file {{Lucene80Codec.java}} in the {{lucene80}} package, picking codec-parts from both 50, 60 and 70. After having implemented the jumps, I have not touched the {{Lucene70Norms*}}-part. I _guess_ I should move the {{Lucene70DocValues*}}-files from {{lucene70}} to {{backward-codecs}}, leaving the norms-classes? Since the norms-classes also uses {{IndexedDISI}}, I expect it would be best to upgrade them too. This would leave the core {{lucene70}} folder empty of active code. {quote}If you move the 7.0 format to lucene/backward-codecs, then you'll need to move it to lucene/backward-codecs/src/resources/META-INF/services/org.apache.lucene.codecs.DocValuesFormat. {quote} That makes sense, thanks! > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712534#comment-16712534 ] Adrien Grand commented on LUCENE-8585: -- bq. unsure if it is best to load the the IndexedDISI and vBPV jump-tables to heap or keep them on storage If the access pattern is sequential, which I assume would be the case in both cases, then it's fine to keep them on storage. bq. I have copied LuceneDocValuesFormat, LuceneDocValuesConsumer, LuceneDocValuesProducer and IndexedDISI from lucene70 to lucene80 and made the changes there. The test-class TestIndexedDISI is specifically for the 7.0 version due to package privacy, so I copied that one too. Was that correct? It seems a bit clunky to have a copy of a class with only a few methods added, but on the other hand it does work well for separation of codec versions. This is the correct approach indeed. We can also move the 7.0 format to lucene/backward-codecs since lucene/core only keeps formats that are used for the current codec. bq. I kept the old org.apache.lucene.codecs.lucene70.Lucene70DocValuesFormat in the file. Was that the right way to do it? If you move the 7.0 format to lucene/backward-codecs, then you'll need to move it to lucene/backward-codecs/src/resources/META-INF/services/org.apache.lucene.codecs.DocValuesFormat. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > Labels: performance > Attachments: make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711062#comment-16711062 ] Toke Eskildsen commented on LUCENE-8585: [~jpountz] the confusion probably stems from the fact that I was (and a little bit still am) unsure if it is best to load the the {{IndexedDISI}} and {{vBPV}} jump-tables to heap or keep them on storage, from a pure performance perspective. Since I am leaning to keeping them on storage and you green light my first take (append to data), everything's great. I would appreciate a sanity check on how to change the codec: I have copied {{LuceneDocValuesFormat}}, {{LuceneDocValuesConsumer}}, {{LuceneDocValuesProducer}} and {{IndexedDISI}} from {{lucene70}} to {{lucene80}} and made the changes there. The test-class {{TestIndexedDISI}} is specifically for the 7.0 version due to package privacy, so I copied that one too. Was that correct? It seems a bit clunky to have a copy of a class with only a few methods added, but on the other hand it does work well for separation of codec versions. I added {{org.apache.lucene.codecs.lucene80.Lucene80DocValuesFormat}} to the {{org.apache.lucene.codecs.DocValuesFormat}} in {{META_INF/services}}, to get the Lucene 80 codec to discover the new DocValues. I kept the old {{org.apache.lucene.codecs.lucene70.Lucene70DocValuesFormat}} in the file. Was that the right way to do it? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe,
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710725#comment-16710725 ] Adrien Grand commented on LUCENE-8585: -- bq. Thank you for your suggestions, Adrien Grand. I am a bit surprised that you advocate for having the jump-tables for IndexedDISI and vBPV on heap? Actually I had misunderstood your proposal. I was mistakenly assuming that you were trying to load a long[] like the change your did in LUCENE-7589. Storing offsets at the end of the data and reading them directly from the Directory via a RandomAccessSlice sounds good to me! > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710601#comment-16710601 ] Toke Eskildsen commented on LUCENE-8585: [~jim.ferenczi] you are off the hook :P. I did not plan on working on this, but it seems that it must be done in order to get DV-scaling in some form into Lucene for the release after 7.6. I have currently hacked {{IndexedDISI}} block jump-tables as well as DENSE block ranking into the codec, but as can be seen above, I might have stored it in the wrong slices. I hope to find the time/energy to look at variable bits per value blocks tomorrow, but I kind of marathoned the first parts today, so no promises. As I did not know how to approach this logistically, I just used my own fork. Should anyone be curious, it is on https://github.com/tokee/lucene-solr/tree/lucene8585 - work in progress and all that. Do you suggest that I make a pull-request from that one to the Lucene GitHub repository in order to benefit from its fine tools for reviews etc.? Apart from the ever-present need for reviewing at some point, some kind of performance testing would be great, should anyone feel like it. Unfortunately this is not a light task as an index of non-trivial size needs to be build (or index-upgraded?). > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710332#comment-16710332 ] Jim Ferenczi commented on LUCENE-8585: -- {quote} As this has the potential to be more of a collaborative project, what is the easiest way (ping to Jim Ferenczi)? Sending patches back and forth seems a bit heavy to me, so perhaps a git branch? Should I create one in the Apache git repository or on GitHub? {quote} Thanks for the ping [~toke]. I volunteered to port your patch in case you would/could not work on the index-time solution. If you can think of a way to split the work I can also help but at this point I'd certainly slow things down ;) (I am no expert in this area of the code either). Regarding the back and forth with patches, I think a pull request is appropriate here and should help to iterate ? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710238#comment-16710238 ] Toke Eskildsen commented on LUCENE-8585: Thank you for your suggestions, [~jpountz]. I am a bit surprised that you advocate for having the jump-tables for {{IndexedDISI}} and {{vBPV}} on heap? They are not that big, but it still means heap + extra startup-time to fetch them!? It also means that it will be done for all fields, regardless of whether they are used or not, if I understand Entry-loading correctly. I have a hard time discerning what the principal difference is between this and search-time build of the jump-tables. The only difference I see is that the index-time version loads the data contiguously instead of having to visit every block, which boils down to segment-open performance. I am not opposed to pre-computing and having them on heap - if nothing else it will make lookups slightly faster - I just can't follow the logic. I tinkered a bit and got the {{IndexedDISI}} block jump-tables to work when stored as I described in the first comment (after the regular blocks, accessed as needed). The scary codec beast is not _that_ scary once you get to know it. It should be easy enough to flush the data to meta instead. As this has the potential to be more of a collaborative project, what is the easiest way (ping to [~jim.ferenczi])? Sending patches back and forth seems a bit heavy to me, so perhaps a git branch? Should I create one in the Apache git repository or on GitHub? > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. --
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709931#comment-16709931 ] Adrien Grand commented on LUCENE-8585: -- bq. It seems that the meta data for the DocValue entries are loaded one-off and that any subsequently needed data is fetched from the data-slices. This is correct. bq. In line with TermsDictEntry, the jump-tables for IndexedDISI-blocks and vBPV can be stored at the end of their respective data-slices, with only the offsets to the jump-tables being stored in meta. I don't think they are comparable: the terms dictionary is kept on disk while jump tables would be fully loaded in memory. Like you said, meta is about information that is loaded one-off, so this is where we should put jump tables. bq. The downside to this solution is that the full jump-tables needs to be kept in memory until the data has been written, for a worst-case temporary overhead of 2MB and 8MB when flushing 2 billion documents with values. Right. I'm a bit less worried about the write-time memory usage than I am about search-time: at write time we only write one field at once. On the other hand at search time we would have jump tables loaded in memory for every field at the same time. Maybe we could explore inlining skip data between DISI blocks similarly to postings alternatively, that would still require memory at index-time because of buffering, but almost nothing at search time. bq. DENSE ranks belongs naturally in their blocks in data +1 > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian
[jira] [Commented] (LUCENE-8585) Create jump-tables for DocValues at index-time
[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709234#comment-16709234 ] Toke Eskildsen commented on LUCENE-8585: Preliminary notes after inspecting the Lucene7DocValues-codec: It seems that the {{meta}} data for the DocValue entries are loaded one-off and that any subsequently needed data is fetched from the {{data}}-slices. In line with {{TermsDictEntry}}, the jump-tables for IndexedDISI-blocks and vBPV can be stored at the end of their respective {{data}}-slices, with only the offsets to the jump-tables being stored in {{meta}}. The downside to this solution is that the full jump-tables needs to be kept in memory until the data has been written, for a worst-case temporary overhead of 2MB and 8MB when flushing 2 billion documents with values. As mentioned in the description, DENSE ranks belongs naturally in their blocks in {{data}}. This makes their representation extremely simple and makes the rank & data be disk cached together. {{meta}} is not involved here. > Create jump-tables for DocValues at index-time > -- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: master (8.0) >Reporter: Toke Eskildsen >Priority: Minor > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org