Alan Gates commented on PIG-793:

Using jmap, I've been toying around with our DefaultTuple implementation to see 
how much memory it takes.  For a tuple with 3 elements, one int, one double, 
one 20 character string I see it taking:

16 bytes for the Tuple object
24 bytes for the ArrayList<Object> in the tuple
~26 bytes for pointers in the ArrayList
16 bytes for the Integer
16 bytes for the Double
24 bytes for the String overhead
~52 bytes for the String data

Pointers in the ArrayList and character data in the String appear to be padded 
and vary somewhat depending on how I run the experiments.

I played with changing the ArrayList<Object> in DefaultTuple to an Object[].  
There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the 
Object[], and as I wrote it to always have the Object[] be exactly the right 
size there is no padding cost.  The downside to this is append becomes a more 
expensive operation because it's growing the Object[] by one every time.  
However, after some investigation I believe that most places we use append can 
be changed to use set, thus alieviating this issue.  I'm working on a patch to 
change this.  Once I have that done I'll report on how that changes memory 
usage as well as any performance gains or losses.

A related item I would like to look into is using Hadoop's Text instead of 
String to back chararray.  Text takes 16 bytes of overhead + 36 bytes for 
string data to store 20 characters, versus the 24 / 52 of String.  Obviously 
this would be a huge change and needs to have very impressive results to be 
considered.  I'll play with it and report results here.

> Improving memory efficiency of Tuple implementation
> ---------------------------------------------------
>                 Key: PIG-793
>                 URL: https://issues.apache.org/jira/browse/PIG-793
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory. 
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since 
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than 
> ArrayList.
> There might be more.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to