[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-03-22 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610420#comment-13610420
 ] 

Commit Tag Bot commented on LUCENE-4620:


[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1432034

LUCENE-4620: IntEncoder/Decoder bulk API


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-16 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555025#comment-13555025
 ] 

Commit Tag Bot commented on LUCENE-4620:


[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1433929

LUCENE-4620: inline encoding/decoding


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-16 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555024#comment-13555024
 ] 

Commit Tag Bot commented on LUCENE-4620:


[trunk commit] Shai Erera
http://svn.apache.org/viewvc?view=revision&revision=1433926

LUCENE-4620: inline encoding/decoding


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555023#comment-13555023
 ] 

Shai Erera commented on LUCENE-4620:


Could be DV helps some too. Also not calling decode() + reset() + doDecode() 
every time must help some too.

Committed the changes to trunk, 4x and 4.1 branch.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555016#comment-13555016
 ] 

Michael McCandless commented on LUCENE-4620:


+1

It's much faster than I had tested before (maybe because of the DV cutover!?):

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  181.98  (1.2%)  182.20  (1.3%)
0.1% (  -2% -2%)
 LowTerm   77.95  (2.0%)   83.59  (2.8%)
7.2% (   2% -   12%)
 MedTerm   26.60  (3.3%)   31.46  (1.4%)   
18.3% (  13% -   23%)
HighTerm   15.83  (3.9%)   19.35  (1.3%)   
22.2% (  16% -   28%)
{noformat}

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552978#comment-13552978
 ] 

Michael McCandless commented on LUCENE-4620:


I think we should just make a specialized accumulator/aggregator, for the 
counts-only-dgap-vint case: that could wouldn't need to populate an IntsRef and 
then make 2nd pass over the ords ... it'd just increment the count for each ord 
as it decodes in.  In previous issues I already tested that this gives a good 
gain ...


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552668#comment-13552668
 ] 

Shai Erera commented on LUCENE-4620:


I see. I have two comments about the patch. This part is wrong:

{code}
+int needed = upto - buf.offset;
+if (values.length < needed) {
+  values.grow(needed);
+}
{code}

should be

{code}
+if (values.ints.length < buf.length) {
+  values.grow(buf.length);
+}
{code}

Does it even run for you? because {{values.length = 0}} at start.

Also, note how this way you check offset < upto on every byte read while in the 
current code it's checked only once per integer read. Maybe if you do a while 
loop inside the loop, something like {{while (b < 0)}}.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
> LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552612#comment-13552612
 ] 

Shai Erera commented on LUCENE-4620:


I made this change to VInt8IntDecoder instead of checking inside the loop:

{code}
int numValues = buf.length; // a value occupies at least 1 byte
if (values.ints.length < numValues) {
  values.grow(numValues);
}
{code}

Ran EncodingSpeed again and compared the results. On average (4 datasets), 
VInt8 achieves a 0.69% speedup, DGap(VInt) 7.85% and 
Sorting(Unique(DGap(VInt))) 10.16%. The last one is the default Encoder, 
thought its decoder is only DGap(VInt), so I'm not sure why the difference 
between that run and the previous one with 7.85%.

However, it does look like it speeds things up...

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552594#comment-13552594
 ] 

Shai Erera commented on LUCENE-4620:


I'm baffled too. There is some overhead with the bulk API, in that it needs to 
{{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552573#comment-13552573
 ] 

Michael McCandless commented on LUCENE-4620:


This change seemed to lose a bit of performance: look at 1/11/2013 on 
http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html

But, that tests just one dimension (Date), with only 3 ords per doc,
so I had assumed that this just wasn't enough ints being decoded to
see the gains from this bulk decoding.

So, I modified luceneutil to have more facets per doc (avg ~25 ords
per doc across 9 dimensions; 2.5M unique ords), and the results are
still slower:

{noformat}
  TaskQPS base  StdDevQPS comp  StdDevPct 
diff
  HighTerm3.62  (2.5%)3.24  (1.0%)  -10.5% ( -13% -   
-7%)
   MedTerm7.34  (1.7%)6.78  (0.9%)   -7.6% ( -10% -   
-5%)
   LowTerm   14.92  (1.6%)   14.32  (1.2%)   -4.0% (  -6% -   
-1%)
  PKLookup  181.47  (4.7%)  183.04  (5.3%)0.9% (  -8% -   
11%)
{noformat}

This is baffling ... not sure what's up.  I would expect some gains
given that the micro-benchmark showed sizable decode improvements.  It
must somehow be that decode cost is a minor part of facet counting?
(which is not a good sign!: it should be a big part of it...)


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>Assignee: Shai Erera
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550863#comment-13550863
 ] 

Shai Erera commented on LUCENE-4620:


bq. I'll open an issue to take care of FacetFields reusability

Done. Opened LUCENE-4680.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550852#comment-13550852
 ] 

Shai Erera commented on LUCENE-4620:


bq. I think we should remove it

Ok I will.

bq. It is unfortunate that the common case is often held back by the full 
flexibility/generality of the facet module

With LUCENE-4647, the common case suffers less from the full generality of the 
facets module. I'll open an issue to take care of FacetFields reusability and 
there I hope I'll be able to tackle successfully the reusability of BytesRefs 
for one as well as many CLPs.

IMO though, having a single entry point for users to index facets, be it 1 
facet per document, or 2500 (a real case!), is important. We need to make sure 
though that the 1 facet case is added the least overhead (e.g. using 
Collections.singletonMap, or the trick I've done in 
CountingListBuilder.OrdinalsEncoder (with/out partitions)).

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549975#comment-13549975
 ] 

Michael McCandless commented on LUCENE-4620:


Trunk:
{noformat}
 [java] Estimating ~1 Integers compression time by
 [java] Encoding/decoding facets' ID payload of docID = 3630 (unsorted, 
length of: 2430) 41152 times.
 [java] 
 [java] EncoderBits/Int  Encode Time
Encode Time  Decode TimeDecode Time
 [java]   [milliseconds]
[microsecond / int]   [milliseconds][microsecond / int]
 [java] 
---
 [java] VInt8   18.4955 4430
44.3003 116211.6201
 [java] Sorting (Unique (VInt8))18.4955 4344
43.4403 110511.0501
 [java] Sorting (Unique (DGap (VInt8))) 8.5597 4481 
   44.8103  842 8.4201
 [java] Sorting (Unique (DGap (EightFlags (VInt8 4.9679 
463646.3603 1021
10.2101
 [java] Sorting (Unique (DGap (FourFlags (VInt8 4.8198  
   451545.1503 1001
10.0101
 [java] Sorting (Unique (DGap (NOnes (3) (FourFlags (VInt8) 4.5794  
   490449.0403 1056 
   10.5601
 [java] Sorting (Unique (DGap (NOnes (4) (FourFlags (VInt8) 4.5794  
   475147.5103 1035 
   10.3501
 [java] 
 [java] 
 [java] Estimating ~1 Integers compression time by
 [java] Encoding/decoding facets' ID payload of docID = 9910 (unsorted, 
length of: 1489) 67159 times.
 [java] 
 [java] EncoderBits/Int  Encode Time
Encode Time  Decode TimeDecode Time
 [java]   [milliseconds]
[microsecond / int]   [milliseconds][microsecond / int]
 [java] 
---
 [java] VInt8   18.2673 1241
12.4100 112811.2800
 [java] Sorting (Unique (VInt8))18.2673 3488
34.8801  924 9.2400
 [java] Sorting (Unique (DGap (VInt8))) 8.9456 3061 
   30.6101  660 6.6000
 [java] Sorting (Unique (DGap (EightFlags (VInt8 5.7542 
369336.9301 1026
10.2600
 [java] Sorting (Unique (DGap (FourFlags (VInt8 5.5447  
   346234.6201  811 
8.1100
 [java] Sorting (Unique (DGap (NOnes (3) (FourFlags (VInt8) 5.3566  
   384638.4601 1018 
   10.1800
 [java] Sorting (Unique (DGap (NOnes (4) (FourFlags (VInt8) 5.3996  
   387938.7901 1025 
   10.2500
 [java] 
 [java] 
 [java] Estimating ~1 Integers compression time by
 [java] Encoding/decoding facets' ID payload of docID = 1 (unsorted, 
length of: 18) 555 times.
 [java] 
 [java] EncoderBits/Int  Encode Time
Encode Time  Decode TimeDecode Time
 [java]   [milliseconds]
[microsecond / int]   [milliseconds][microsecond / int]
 [java] 
---
 [java] VInt8   20.8889 1179
11.7900 111411.1400
 [java] Sorting (Unique (VInt8))20.8889 2251
22.5100 117111.7100
 [java] Sorting (Unique (DGap (VInt8)))12. 2174 
   21.7400  848 8.4800
 [java] Sorting (Unique (DGap (EightFlags (VInt810. 
237223.7200 1092
1

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549961#comment-13549961
 ] 

Michael McCandless commented on LUCENE-4620:


{quote}
bq. Can we use Collections.singletonMap when there are no partitions?

Done. Note though that BytesRef cannot be reused in the case of 
PerDimensionIndexingParams (i.e. multiple CLPs). This is not the common case, 
but it's not trivial to specialize it. Maybe as a second iteration. I did put a 
TODO in FacetFields to allow reuse.
{quote}

Well, we'd somehow need N BytesRefs to reuse (one per CLP) ... but I
don't think we should worry about that now.

It is unfortunate that the common case is often held back by the full
flexibility/generality of the facet module ... sometimes I think we
need a facet-light module.  But maybe if we can get the specialization
done we don't need facet-light ...

{quote}
bq. why do we have VInt8.bytesNeeded? Who uses that?

Currently no one uses it, but it was there and I thought that it's a convenient 
API to keep. Why encode and then see how many bytes were occupied?
Anyway, neither the encoders nor the decoders use it. I have no strong feelings 
for keeping/removing it, so if you feel like it should be removed, I can do it.
{quote}

I think we should remove it: it's a dangerous API because it can
encourage consumers to do things like call bytesNeeded first (to know
how much to grow their buffer, say) followed by encoding.  The slow
part of vInt encoding is all those ifs ...

{quote}
bq. Hmm, it's a little abusive how VInt8.decode changes the offset of the 
incoming BytesRef

It is, but that's the result of Java's lack of pass by reference. I.e., decode 
needs to return the caller two values: the decoded number and how many bytes 
were read.
Notice that in the previous byte[] variant, the method took a class Position, 
which is horrible. That's why I documented in decode() that it advances 
bytes.offset, so
the caller can restore it in the end. For instance, IntDecoder restores the 
offset to the original one in the end.

On LUCENE-4675 Robert gave me an idea to create a BytesRefIterator, and I 
started to play with it. I.e. it would wrap a BytesRef but add 'pos' and 'upto' 
indexes.
The user can modify 'pos' freely, withouth touching bytes.offset. That 
introduces an object allocation though, and since I'd want to reuse that object 
wherever
possible, I think I'll look at it after finishing this issue. It already 
contains too many changes.
{quote}

OK.

{quote}
bq. I guess this is why you want an upto

No, I wanted upto because iterating up to bytes.length is incorrect. You need 
to iterate up to offset+length. BytesRefIterator.pos and BytesRefIterator.upto 
solve these cases for me.
{quote}

OK.

{quote}
bq. looks like things got a bit slower (or possibly it's noise)

First, even if it's not noise, the slowdown IMO is worth the code 
simplification.
{quote}

+1

{quote}
But, I do believe that we'll see gains when there are more than 3 integers to 
encode/decode.
In fact, the facets test package has an EncodingSpeed class which measures the 
time it takes to encode/decode a large number of integers (a few thousands). 
When I compared the
result to 4x (i.e. without the patch), the decode time seemed to be ~x5 faster.
{quote}

Good!  Would be nice to have a real-world biggish-number-of-facets
benchmark ... I'll ponder how to do that w/ luceneutil.

bq. In this patch I added an Ant task "run-encoding-benchmark" which runs this 
class. Want to give it a try on your beast machine? For 4x, you can just copy 
the target to lucene/facet/build.xml, I believe it will work without issues.

OK I'll run it!


> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will us

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549759#comment-13549759
 ] 

Michael McCandless commented on LUCENE-4620:


Thanks Shai, that new patch worked!

This patch looks great!

It's a little disturbing that every doc must make a new
HashMap at indexing time (seems like a lot of
overhead/objects when the common case just needs to return a single
BytesRef, which could be re-used).  Can we use
Collections.singletonMap when there are no partitions?

The decode API (more important than encode) looks like it reuses the
Bytes/IntsRef, so that's good.

Hmm why do we have VInt8.bytesNeeded?  Who uses that?  I think that's
a dangerous API to have  it's better to simply encode and then see
how many bytes it took.

Hmm, it's a little abusive how VInt8.decode changes the offset of the
incoming BytesRef ... I guess this is why you want an upto :)

Net/net this is great progress over what we have today, so +1!

I ran a quick 10M English Wikipedia test w/ just term queries:
{noformat}
TaskQPS base  StdDevQPS comp  StdDevPct diff
   HighTerm   12.79  (2.4%)   12.56  (1.2%)   -1.8% 
(  -5% -1%)
MedTerm   18.04  (1.8%)   17.77  (0.8%)   -1.5% 
(  -4% -1%)
LowTerm   47.69  (1.1%)   47.56  (1.0%)   -0.3% 
(  -2% -1%)
{noformat}

The test only has 3 ords per doc so it's not "typical" ... looks like things 
got a bit slower (or possibly it's noise).

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549568#comment-13549568
 ] 

Michael McCandless commented on LUCENE-4620:


Looks like there were some svn mv's, so the patch doesn't directly apply ...

Can you regenerate the patch using 'svn diff --show-copies-as-adds' (assuming 
you're using svn 1.7+)?

Either that or use dev-tools/scripts/diffSources.py ... thanks.

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
> Attachments: LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2012-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529966#comment-13529966
 ] 

Shai Erera commented on LUCENE-4620:


Also, today there are few IntEncoders which are used during indexing only, e.g. 
SortingIntEncoder and UniqueIntEncoder which guarantee that an ordinal will be 
written just once to the payload, and sort them so that DGap can be computed 
afterwards. These do not have a matching Decoder, and they shouldn't have, 
because at search time you don't care if the ords are sorted or not, and you 
can assume they are unique.

Another thing that I think we should do is move those encoders into the *.facet 
package. They are currently under the facet module, but o.a.l.util, b/c again 
we thought at the time that they are a generic piece of code for 
encoding/decoding integers. Lucene has PackedInts and DataInput/Output for 
doing block and VInt encodings. Users can write Codecs for other encoding 
algorithms ... IntEncoder/Decoder are not that generic :).

> Explore IntEncoder/Decoder bulk API
> ---
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Shai Erera
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org