Re: A proposal for changing pig's memory management

Alan Gates Tue, 19 May 2009 16:32:30 -0700

We definitely do not want to follow the current design of keepingchararrays and bytearrays as separate objects. It is that overhead ofan object for each field that we are trying to avoid.

The reason for constraining a tuple to store its data in oneTupleBuffer is to limit the size of the Tuple object. If the data canspan TupleBuffers, then a Tuple has to a TupleBuffer reference foreach field, almost doubling the size of a Tuple.

It is worth noting that a number of pig operators (sort, filter,distinct) need not copy any data. Some have to copy (foreach, thoughthis could be optimized not to in certain cases such as simpleprojections), and some could get away without copying if tuples couldreference fields across buffers (join, union, cross).


I think we should do some experimentation with this and see:

1) What is the memory saving from moving from objects to buffers, ieis this worth it at all.2) What is the additional memory cost of storing a TupleBufferreference per field.3) What is the performance penalty of copying data on joins, unions,etc.


Once these are known it should be easier to make trade off decisions.

There is one other option. I had said that it would be better to havea few very large (on the order of 10M) buffers. The reasons Iconsidered that was that I didn't want so many buffers themselves thatmanaging them became a burden on the system, and that we have tosomehow handle the case of chararray, bytearray, or map fields thatwon't fit in a single TupleBuffer (assumably by storing those as anobject instead of in the buffer). The larger we make the buffers theless we encounter this issue.

Instead of using large buffers we could use smaller ones. If wecapped the size of a buffer at 65K and the number of buffers a singletuple could reference at 65K, then a tuple could still see 4G ofmemory but still only use 4 bytes per field to point to the data.This way join operations could be done without copying the data. Thisoption should be experimented with as well. It may be that usingsmaller buffers is better since the cost of reading and writing themon disk will be less.


Alan.


On May 15, 2009, at 11:23 AM, Thejas Nair wrote:

With a constraint that all scalar values in a tuple should fit intoa singlebuffer, the values will always have to be copied whenever a tuplecontents
need to be copied to a new tuple after a relational operation.

The overhead of copying is not large for numeric types compared to the
existing implementation, because we already copy the objectreferences. Butit can be large overhead for chararray/bytearray data types thatare long
enough.

To avoid this performance penalty, we should not require these larger
datatypes to be stored in the same buffer, and maybe follow thedesign in
current implemenation for those, ie store them in java objects.
To prevent the bloating issue when 8byte chars are stored in Stringobjects,
we can delay their conversion into String objects and store them like
bytearray until some String operation needs to be done. For any memory
intensive operations like join, we can store them again as bytearray.
I assume that in the current design you would be doing somethingsimilar(treating chararray the same way as bytearray) until Stringoperations need
to be done.

Thanks,
Thejas




On 5/14/09 5:33 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:
http://wiki.apache.org/pig/PigMemory

Alan.

Re: A proposal for changing pig's memory management

Reply via email to