Alan Gates commented on PIG-793:

The cost for storing data raw is:

16 bytes for the tuple object
12 bytes for the byte array object
12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[]
Then as you say above for the data itself, plus 1 byte per field to store type 
and nullness.

So our example tuple would take ~85 bytes.

But in general, yes you can do much better with raw bytes.  We played with this 
some and we found that the cost of Tuple.get/set goes up 10x because of the 
need to turn the bytes into objects.  In a typical query this added about 2x to 
the overall run time.  The solution to this would be to rewrite all the Pig 
operators to work on byte data instead of objects.  This is a large project, 
and doesn't solve the UDFs.  We could pay the performance penalty for UDFs, or 
we could change the UDFs to take byte data.  Currently many of our users are 
asking for the ability to write UDFs in Python or other scripting languages.  
If we instead go the other way and basically make them write C style Java I 
don't think that will be popular.

What we're playing with now (changing ArrayList<Object> to Object[] and String 
to Text) will reap somewhere around 50% of the benefits in terms of memory 
savings as going to fully raw data.  But it's around 10% of the work.  I'm not 
excluding moving to storing everything in a byte[] in the future.  But I want 
to see if for a little work now we can get a descent amount of improvement.

> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to