[ 
https://issues.apache.org/jira/browse/PIG-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286758#comment-13286758
 ] 

Jonathan Coveney commented on PIG-2638:
---------------------------------------

Ashutosh,

Given the way that we currently serialize values, there is actually no gain to 
using varint, because we are, in all cases, writing a byte value that specifies 
what is being serialized. In fact we can do better than varint... instead of 
needing a bit flag at the head of every byte, we can just have something like 
the follows:

INT_1BYTE,
INT_2BYTE,
INT_3BYTE,
INT_4BYTE

and the same analogue for the long. Given that currently there is no way NOT to 
write that object identification byte, the gain from varint/varlong doesn't 
exist, since you can do it more compactly (given what we do) anyway). However, 
as I work to erase that need (in SchemaTuple, for example), varint/varlong 
begin to make a lot more sense.

I think this patch is some really easy low hanging fruit, and in the future I 
have some ideas around how to greatly improve serialization performance that 
will be more sweeping.

Would love your thoughts.
                
> Optimize BinInterSedes treatment of longs
> -----------------------------------------
>
>                 Key: PIG-2638
>                 URL: https://issues.apache.org/jira/browse/PIG-2638
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jonathan Coveney
>            Assignee: Jonathan Coveney
>             Fix For: 0.11, 0.10.1
>
>         Attachments: PIG-2638-0.patch, PIG-2638-1.patch
>
>
> During adventures in BinInterSedes, I noticed that Integers are written in an 
> optimized fashion, but longs are not. Given that, in the general case, we 
> have to write type information anyway, we might as well do the same 
> optimization for Longs. That is to say, given that most longs won't have 8 
> bytes of information in them, why should we waste the space of serializing 8 
> bytes?
> This patch takes its inspiration from varint encoding per these two sources:
> http://javasourcecode.org/html/open-source/mahout/mahout-0.5/org/apache/mahout/math/Varint.java.html
> https://developers.google.com/protocol-buffers/docs/encoding
> Though, nicely enough, we don't actually have to use varints. Since we HAVE 
> to write an 8 byte type header, we might as well include the number of bytes 
> we had to write. I use zig zag encoding so that in the case of negative 
> numbers, we see the benefit.
> This should decrease the amount of serialized long data by a good bit.
> Patch incoming. It passes test-commit in 0.11.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to