[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-16 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Attached patch:

* Inlines VInt8 encode/decode in relevant encoders/decdoers.

* Marks encoders/decoders final.

* Gets rid of the decode() + doDecode(). It was nice while I wrote it, but I 
figure that this is a hot code, and every method call counts, especially when 
called for few values usually.

* Decoders no longer mess w/ bytes.offset (now that the decoding is inlined).

* Removed VInt8 class and test.

Mike, would you like to run luceneutil with this patch?

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Patch, fixing that bug Shai found.

Performance is better with this specialization:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  192.61  (4.5%)  193.06  (4.2%)
0.2% (  -8% -9%)
 LowTerm   15.33  (1.6%)   15.44  (2.5%)
0.7% (  -3% -4%)
 MedTerm7.60  (0.7%)7.74  (1.8%)
1.9% (   0% -4%)
HighTerm3.85  (0.6%)3.97  (1.2%)
3.1% (   1% -4%)
{noformat}

I also tried the unrolling of the vInt loop but perf was strangely quite a bit 
worse..

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Maybe doing bulk-vInt-decode (see patch) will be faster (just make hotspot's 
job easier) ... I'll test.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

bq. Can we use Collections.singletonMap when there are no partitions?

Done. Note though that BytesRef cannot be reused in the case of 
PerDimensionIndexingParams (i.e. multiple CLPs). This is not the common case, 
but it's not trivial to specialize it. Maybe as a second iteration. I did put a 
TODO in FacetFields to allow reuse.

bq. why do we have VInt8.bytesNeeded? Who uses that?

Currently no one uses it, but it was there and I thought that it's a convenient 
API to keep. Why encode and then see how many bytes were occupied?
Anyway, neither the encoders nor the decoders use it. I have no strong feelings 
for keeping/removing it, so if you feel like it should be removed, I can do it.

bq. Hmm, it's a little abusive how VInt8.decode changes the offset of the 
incoming BytesRef

It is, but that's the result of Java's lack of pass by reference. I.e., decode 
needs to return the caller two values: the decoded number and how many bytes 
were read.
Notice that in the previous byte[] variant, the method took a class Position, 
which is horrible. That's why I documented in decode() that it advances 
bytes.offset, so
the caller can restore it in the end. For instance, IntDecoder restores the 
offset to the original one in the end.

On LUCENE-4675 Robert gave me an idea to create a BytesRefIterator, and I 
started to play with it. I.e. it would wrap a BytesRef but add 'pos' and 'upto' 
indexes.
The user can modify 'pos' freely, withouth touching bytes.offset. That 
introduces an object allocation though, and since I'd want to reuse that object 
wherever
possible, I think I'll look at it after finishing this issue. It already 
contains too many changes.

bq. I guess this is why you want an upto

No, I wanted upto because iterating up to bytes.length is incorrect. You need 
to iterate up to offset+length. BytesRefIterator.pos and BytesRefIterator.upto 
solve these cases for me.

bq. looks like things got a bit slower (or possibly it's noise)

First, even if it's not noise, the slowdown IMO is worth the code 
simplification. But, I do believe that we'll see gains when there are more than 
3 integers to encode/decode.
In fact, the facets test package has an EncodingSpeed class which measures the 
time it takes to encode/decode a large number of integers (a few thousands). 
When I compared the
result to 4x (i.e. without the patch), the decode time seemed to be ~x5 faster.

In this patch I added an Ant task "run-encoding-benchmark" which runs this 
class. Want to give it a try on your beast machine? For 4x, you can just copy 
the target to lucene/facet/build.xml, I believe it will work without issues.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Sorry. Can you try now?

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Patch makes the following changes:

* {{IntEncoder.encode()}} takes an {{IntsRef}} and {{BytesRef}} and encodes the 
integers from {{IntsRef}} to {{BytesRef}}. Similarily, {{IntDecoder.decode()}} 
takes a {{BytesRef}} and {{IntsRef}} and decodes the integers from the byte 
array to the integer array.

* {{CategoryListIterator}} and {{Aggregator}} were changed to do bulk handling 
of category ordinals as well.

* In the process I merged some methods such as {{PayloadIterator.setdoc}} and 
{{PayloadIterator.getPayload}}, as well as {{AssociationsPayloadIterator}}, to 
reduce even further the number of method calls that happen during search.

* Added a test which tests MultiCategoryListIterator (we didn't have one!) and 
improved EncodingTest to test a large number of random values.

All tests pass, and 'ant javadocs' passes too.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org