Alan Gates commented on PIG-793:
Using jmap, I've been toying around with our DefaultTuple implementation to see
how much memory it takes. For a tuple with 3 elements, one int, one double,
one 20 character string I see it taking:
16 bytes for the Tuple object
24 bytes for the ArrayList<Object> in the tuple
~26 bytes for pointers in the ArrayList
16 bytes for the Integer
16 bytes for the Double
24 bytes for the String overhead
~52 bytes for the String data
Pointers in the ArrayList and character data in the String appear to be padded
and vary somewhat depending on how I run the experiments.
I played with changing the ArrayList<Object> in DefaultTuple to an Object.
There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the
Object, and as I wrote it to always have the Object be exactly the right
size there is no padding cost. The downside to this is append becomes a more
expensive operation because it's growing the Object by one every time.
However, after some investigation I believe that most places we use append can
be changed to use set, thus alieviating this issue. I'm working on a patch to
change this. Once I have that done I'll report on how that changes memory
usage as well as any performance gains or losses.
A related item I would like to look into is using Hadoop's Text instead of
String to back chararray. Text takes 16 bytes of overhead + 36 bytes for
string data to store 20 characters, versus the 24 / 52 of String. Obviously
this would be a huge change and needs to have very impressive results to be
considered. I'll play with it and report results here.
> Improving memory efficiency of Tuple implementation
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory.
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than
> There might be more.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.