Alan Gates commented on PIG-793:
The cost for storing data raw is:
16 bytes for the tuple object
12 bytes for the byte array object
12 bytes + 2 bytes/field for a short to hold offsets into the byte
Then as you say above for the data itself, plus 1 byte per field to store type
So our example tuple would take ~85 bytes.
But in general, yes you can do much better with raw bytes. We played with this
some and we found that the cost of Tuple.get/set goes up 10x because of the
need to turn the bytes into objects. In a typical query this added about 2x to
the overall run time. The solution to this would be to rewrite all the Pig
operators to work on byte data instead of objects. This is a large project,
and doesn't solve the UDFs. We could pay the performance penalty for UDFs, or
we could change the UDFs to take byte data. Currently many of our users are
asking for the ability to write UDFs in Python or other scripting languages.
If we instead go the other way and basically make them write C style Java I
don't think that will be popular.
What we're playing with now (changing ArrayList<Object> to Object and String
to Text) will reap somewhere around 50% of the benefits in terms of memory
savings as going to fully raw data. But it's around 10% of the work. I'm not
excluding moving to storing everything in a byte in the future. But I want
to see if for a little work now we can get a descent amount of improvement.
> Improving memory efficiency of Tuple implementation
> Key: PIG-793
> URL: https://issues.apache.org/jira/browse/PIG-793
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Alan Gates
> Currently, our tuple is a real pig and uses a lot of extra memory.
> There are several places where we can improve memory efficiency:
> (1) Laying out memory for the fields rather than using java objects since
> since each object for a numeric field takes 16 bytes
> (2) For the cases where we know the schema using Java arrays rather than
> There might be more.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.