[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610420#comment-13610420 ] Commit Tag Bot commented on LUCENE-4620: [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1432034 LUCENE-4620: IntEncoder/Decoder bulk API > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555025#comment-13555025 ] Commit Tag Bot commented on LUCENE-4620: [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1433929 LUCENE-4620: inline encoding/decoding > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555024#comment-13555024 ] Commit Tag Bot commented on LUCENE-4620: [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1433926 LUCENE-4620: inline encoding/decoding > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555023#comment-13555023 ] Shai Erera commented on LUCENE-4620: Could be DV helps some too. Also not calling decode() + reset() + doDecode() every time must help some too. Committed the changes to trunk, 4x and 4.1 branch. > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555016#comment-13555016 ] Michael McCandless commented on LUCENE-4620: +1 It's much faster than I had tested before (maybe because of the DV cutover!?): {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 181.98 (1.2%) 182.20 (1.3%) 0.1% ( -2% -2%) LowTerm 77.95 (2.0%) 83.59 (2.8%) 7.2% ( 2% - 12%) MedTerm 26.60 (3.3%) 31.46 (1.4%) 18.3% ( 13% - 23%) HighTerm 15.83 (3.9%) 19.35 (1.3%) 22.2% ( 16% - 28%) {noformat} > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552978#comment-13552978 ] Michael McCandless commented on LUCENE-4620: I think we should just make a specialized accumulator/aggregator, for the counts-only-dgap-vint case: that could wouldn't need to populate an IntsRef and then make 2nd pass over the ords ... it'd just increment the count for each ord as it decodes in. In previous issues I already tested that this gives a good gain ... > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552668#comment-13552668 ] Shai Erera commented on LUCENE-4620: I see. I have two comments about the patch. This part is wrong: {code} +int needed = upto - buf.offset; +if (values.length < needed) { + values.grow(needed); +} {code} should be {code} +if (values.ints.length < buf.length) { + values.grow(buf.length); +} {code} Does it even run for you? because {{values.length = 0}} at start. Also, note how this way you check offset < upto on every byte read while in the current code it's checked only once per integer read. Maybe if you do a while loop inside the loop, something like {{while (b < 0)}}. > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, > LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552612#comment-13552612 ] Shai Erera commented on LUCENE-4620: I made this change to VInt8IntDecoder instead of checking inside the loop: {code} int numValues = buf.length; // a value occupies at least 1 byte if (values.ints.length < numValues) { values.grow(numValues); } {code} Ran EncodingSpeed again and compared the results. On average (4 datasets), VInt8 achieves a 0.69% speedup, DGap(VInt) 7.85% and Sorting(Unique(DGap(VInt))) 10.16%. The last one is the default Encoder, thought its decoder is only DGap(VInt), so I'm not sure why the difference between that run and the previous one with 7.85%. However, it does look like it speeds things up... > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552594#comment-13552594 ] Shai Erera commented on LUCENE-4620: I'm baffled too. There is some overhead with the bulk API, in that it needs to {{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I believe that this growing should stabilize after few docs (i.e. the array becomes large enough). Still, every iteration checks if the array is large enough, so perhaps if we grow the IntsRef upfront (even if too much), we can remove the 'ifs'. SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it should just grow by buf.length / 4. VInt is more tricky, but to be on the safe side it can grow by buf.length, as at the minimum each value occupies only one byte. Some other decoders are trickier, but they are not in effect in your test above. But I must admit that I thought it's a no brainer that replacing an iterator API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} shows nice improvements already. And even if decoding values is not the major part of faceted search (which I doubt), we shouldn't see slowdowns? At the most we shouldn't see big wins? > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552573#comment-13552573 ] Michael McCandless commented on LUCENE-4620: This change seemed to lose a bit of performance: look at 1/11/2013 on http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html But, that tests just one dimension (Date), with only 3 ords per doc, so I had assumed that this just wasn't enough ints being decoded to see the gains from this bulk decoding. So, I modified luceneutil to have more facets per doc (avg ~25 ords per doc across 9 dimensions; 2.5M unique ords), and the results are still slower: {noformat} TaskQPS base StdDevQPS comp StdDevPct diff HighTerm3.62 (2.5%)3.24 (1.0%) -10.5% ( -13% - -7%) MedTerm7.34 (1.7%)6.78 (0.9%) -7.6% ( -10% - -5%) LowTerm 14.92 (1.6%) 14.32 (1.2%) -4.0% ( -6% - -1%) PKLookup 181.47 (4.7%) 183.04 (5.3%)0.9% ( -8% - 11%) {noformat} This is baffling ... not sure what's up. I would expect some gains given that the micro-benchmark showed sizable decode improvements. It must somehow be that decode cost is a minor part of facet counting? (which is not a good sign!: it should be a big part of it...) > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera >Assignee: Shai Erera > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550863#comment-13550863 ] Shai Erera commented on LUCENE-4620: bq. I'll open an issue to take care of FacetFields reusability Done. Opened LUCENE-4680. > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550852#comment-13550852 ] Shai Erera commented on LUCENE-4620: bq. I think we should remove it Ok I will. bq. It is unfortunate that the common case is often held back by the full flexibility/generality of the facet module With LUCENE-4647, the common case suffers less from the full generality of the facets module. I'll open an issue to take care of FacetFields reusability and there I hope I'll be able to tackle successfully the reusability of BytesRefs for one as well as many CLPs. IMO though, having a single entry point for users to index facets, be it 1 facet per document, or 2500 (a real case!), is important. We need to make sure though that the 1 facet case is added the least overhead (e.g. using Collections.singletonMap, or the trick I've done in CountingListBuilder.OrdinalsEncoder (with/out partitions)). > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549975#comment-13549975 ] Michael McCandless commented on LUCENE-4620: Trunk: {noformat} [java] Estimating ~1 Integers compression time by [java] Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. [java] [java] EncoderBits/Int Encode Time Encode Time Decode TimeDecode Time [java] [milliseconds] [microsecond / int] [milliseconds][microsecond / int] [java] --- [java] VInt8 18.4955 4430 44.3003 116211.6201 [java] Sorting (Unique (VInt8))18.4955 4344 43.4403 110511.0501 [java] Sorting (Unique (DGap (VInt8))) 8.5597 4481 44.8103 842 8.4201 [java] Sorting (Unique (DGap (EightFlags (VInt8 4.9679 463646.3603 1021 10.2101 [java] Sorting (Unique (DGap (FourFlags (VInt8 4.8198 451545.1503 1001 10.0101 [java] Sorting (Unique (DGap (NOnes (3) (FourFlags (VInt8) 4.5794 490449.0403 1056 10.5601 [java] Sorting (Unique (DGap (NOnes (4) (FourFlags (VInt8) 4.5794 475147.5103 1035 10.3501 [java] [java] [java] Estimating ~1 Integers compression time by [java] Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. [java] [java] EncoderBits/Int Encode Time Encode Time Decode TimeDecode Time [java] [milliseconds] [microsecond / int] [milliseconds][microsecond / int] [java] --- [java] VInt8 18.2673 1241 12.4100 112811.2800 [java] Sorting (Unique (VInt8))18.2673 3488 34.8801 924 9.2400 [java] Sorting (Unique (DGap (VInt8))) 8.9456 3061 30.6101 660 6.6000 [java] Sorting (Unique (DGap (EightFlags (VInt8 5.7542 369336.9301 1026 10.2600 [java] Sorting (Unique (DGap (FourFlags (VInt8 5.5447 346234.6201 811 8.1100 [java] Sorting (Unique (DGap (NOnes (3) (FourFlags (VInt8) 5.3566 384638.4601 1018 10.1800 [java] Sorting (Unique (DGap (NOnes (4) (FourFlags (VInt8) 5.3996 387938.7901 1025 10.2500 [java] [java] [java] Estimating ~1 Integers compression time by [java] Encoding/decoding facets' ID payload of docID = 1 (unsorted, length of: 18) 555 times. [java] [java] EncoderBits/Int Encode Time Encode Time Decode TimeDecode Time [java] [milliseconds] [microsecond / int] [milliseconds][microsecond / int] [java] --- [java] VInt8 20.8889 1179 11.7900 111411.1400 [java] Sorting (Unique (VInt8))20.8889 2251 22.5100 117111.7100 [java] Sorting (Unique (DGap (VInt8)))12. 2174 21.7400 848 8.4800 [java] Sorting (Unique (DGap (EightFlags (VInt810. 237223.7200 1092 1
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549961#comment-13549961 ] Michael McCandless commented on LUCENE-4620: {quote} bq. Can we use Collections.singletonMap when there are no partitions? Done. Note though that BytesRef cannot be reused in the case of PerDimensionIndexingParams (i.e. multiple CLPs). This is not the common case, but it's not trivial to specialize it. Maybe as a second iteration. I did put a TODO in FacetFields to allow reuse. {quote} Well, we'd somehow need N BytesRefs to reuse (one per CLP) ... but I don't think we should worry about that now. It is unfortunate that the common case is often held back by the full flexibility/generality of the facet module ... sometimes I think we need a facet-light module. But maybe if we can get the specialization done we don't need facet-light ... {quote} bq. why do we have VInt8.bytesNeeded? Who uses that? Currently no one uses it, but it was there and I thought that it's a convenient API to keep. Why encode and then see how many bytes were occupied? Anyway, neither the encoders nor the decoders use it. I have no strong feelings for keeping/removing it, so if you feel like it should be removed, I can do it. {quote} I think we should remove it: it's a dangerous API because it can encourage consumers to do things like call bytesNeeded first (to know how much to grow their buffer, say) followed by encoding. The slow part of vInt encoding is all those ifs ... {quote} bq. Hmm, it's a little abusive how VInt8.decode changes the offset of the incoming BytesRef It is, but that's the result of Java's lack of pass by reference. I.e., decode needs to return the caller two values: the decoded number and how many bytes were read. Notice that in the previous byte[] variant, the method took a class Position, which is horrible. That's why I documented in decode() that it advances bytes.offset, so the caller can restore it in the end. For instance, IntDecoder restores the offset to the original one in the end. On LUCENE-4675 Robert gave me an idea to create a BytesRefIterator, and I started to play with it. I.e. it would wrap a BytesRef but add 'pos' and 'upto' indexes. The user can modify 'pos' freely, withouth touching bytes.offset. That introduces an object allocation though, and since I'd want to reuse that object wherever possible, I think I'll look at it after finishing this issue. It already contains too many changes. {quote} OK. {quote} bq. I guess this is why you want an upto No, I wanted upto because iterating up to bytes.length is incorrect. You need to iterate up to offset+length. BytesRefIterator.pos and BytesRefIterator.upto solve these cases for me. {quote} OK. {quote} bq. looks like things got a bit slower (or possibly it's noise) First, even if it's not noise, the slowdown IMO is worth the code simplification. {quote} +1 {quote} But, I do believe that we'll see gains when there are more than 3 integers to encode/decode. In fact, the facets test package has an EncodingSpeed class which measures the time it takes to encode/decode a large number of integers (a few thousands). When I compared the result to 4x (i.e. without the patch), the decode time seemed to be ~x5 faster. {quote} Good! Would be nice to have a real-world biggish-number-of-facets benchmark ... I'll ponder how to do that w/ luceneutil. bq. In this patch I added an Ant task "run-encoding-benchmark" which runs this class. Want to give it a try on your beast machine? For 4x, you can just copy the target to lucene/facet/build.xml, I believe it will work without issues. OK I'll run it! > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will us
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549759#comment-13549759 ] Michael McCandless commented on LUCENE-4620: Thanks Shai, that new patch worked! This patch looks great! It's a little disturbing that every doc must make a new HashMap at indexing time (seems like a lot of overhead/objects when the common case just needs to return a single BytesRef, which could be re-used). Can we use Collections.singletonMap when there are no partitions? The decode API (more important than encode) looks like it reuses the Bytes/IntsRef, so that's good. Hmm why do we have VInt8.bytesNeeded? Who uses that? I think that's a dangerous API to have it's better to simply encode and then see how many bytes it took. Hmm, it's a little abusive how VInt8.decode changes the offset of the incoming BytesRef ... I guess this is why you want an upto :) Net/net this is great progress over what we have today, so +1! I ran a quick 10M English Wikipedia test w/ just term queries: {noformat} TaskQPS base StdDevQPS comp StdDevPct diff HighTerm 12.79 (2.4%) 12.56 (1.2%) -1.8% ( -5% -1%) MedTerm 18.04 (1.8%) 17.77 (0.8%) -1.5% ( -4% -1%) LowTerm 47.69 (1.1%) 47.56 (1.0%) -0.3% ( -2% -1%) {noformat} The test only has 3 ords per doc so it's not "typical" ... looks like things got a bit slower (or possibly it's noise). > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4620.patch, LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549568#comment-13549568 ] Michael McCandless commented on LUCENE-4620: Looks like there were some svn mv's, so the patch doesn't directly apply ... Can you regenerate the patch using 'svn diff --show-copies-as-adds' (assuming you're using svn 1.7+)? Either that or use dev-tools/scripts/diffSources.py ... thanks. > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > Attachments: LUCENE-4620.patch > > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API
[ https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529966#comment-13529966 ] Shai Erera commented on LUCENE-4620: Also, today there are few IntEncoders which are used during indexing only, e.g. SortingIntEncoder and UniqueIntEncoder which guarantee that an ordinal will be written just once to the payload, and sort them so that DGap can be computed afterwards. These do not have a matching Decoder, and they shouldn't have, because at search time you don't care if the ords are sorted or not, and you can assume they are unique. Another thing that I think we should do is move those encoders into the *.facet package. They are currently under the facet module, but o.a.l.util, b/c again we thought at the time that they are a generic piece of code for encoding/decoding integers. Lucene has PackedInts and DataInput/Output for doing block and VInt encodings. Users can write Codecs for other encoding algorithms ... IntEncoder/Decoder are not that generic :). > Explore IntEncoder/Decoder bulk API > --- > > Key: LUCENE-4620 > URL: https://issues.apache.org/jira/browse/LUCENE-4620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Shai Erera > > Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) > and decode(int). Originally, we believed that this layer can be useful for > other scenarios, but in practice it's used only for writing/reading the > category ordinals from payload/DV. > Therefore, Mike and I would like to explore a bulk API, something like > encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder > can still be streaming (as we don't know in advance how many ints will be > written), dunno. Will figure this out as we go. > One thing to check is whether the bulk API can work w/ e.g. facet > associations, which can write arbitrary byte[], and so may decoding to an > IntsRef won't make sense. This too we'll figure out as we go. I don't rule > out that associations will use a different bulk API. > At the end of the day, the requirement is for someone to be able to configure > how ordinals are written (i.e. different encoding schemes: VInt, PackedInts > etc.) and later read, with as little overhead as possible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org