We definitely do not want to follow the current design of keeping chararrays and bytearrays as separate objects. It is that overhead of an object for each field that we are trying to avoid.

The reason for constraining a tuple to store its data in one TupleBuffer is to limit the size of the Tuple object. If the data can span TupleBuffers, then a Tuple has to a TupleBuffer reference for each field, almost doubling the size of a Tuple.

It is worth noting that a number of pig operators (sort, filter, distinct) need not copy any data. Some have to copy (foreach, though this could be optimized not to in certain cases such as simple projections), and some could get away without copying if tuples could reference fields across buffers (join, union, cross).

I think we should do some experimentation with this and see:

1) What is the memory saving from moving from objects to buffers, ie is this worth it at all. 2) What is the additional memory cost of storing a TupleBuffer reference per field. 3) What is the performance penalty of copying data on joins, unions, etc.

Once these are known it should be easier to make trade off decisions.

There is one other option. I had said that it would be better to have a few very large (on the order of 10M) buffers. The reasons I considered that was that I didn't want so many buffers themselves that managing them became a burden on the system, and that we have to somehow handle the case of chararray, bytearray, or map fields that won't fit in a single TupleBuffer (assumably by storing those as an object instead of in the buffer). The larger we make the buffers the less we encounter this issue.

Instead of using large buffers we could use smaller ones. If we capped the size of a buffer at 65K and the number of buffers a single tuple could reference at 65K, then a tuple could still see 4G of memory but still only use 4 bytes per field to point to the data. This way join operations could be done without copying the data. This option should be experimented with as well. It may be that using smaller buffers is better since the cost of reading and writing them on disk will be less.

Alan.


On May 15, 2009, at 11:23 AM, Thejas Nair wrote:

With a constraint that all scalar values in a tuple should fit into a single buffer, the values will always have to be copied whenever a tuple contents
need to be copied to a new tuple after a relational operation.

The overhead of copying is not large for numeric types compared to the
existing implementation, because we already copy the object references. But it can be large overhead for chararray/bytearray data types that are long
enough.

To avoid this performance penalty, we should not require these larger
datatypes to be stored in the same buffer, and maybe follow the design in
current implemenation for those, ie store them in java objects.
To prevent the bloating issue when 8byte chars are stored in String objects,
we can delay their conversion into String objects and store them like
bytearray until some String operation needs to be done. For any memory
intensive operations like join, we can store them again as bytearray.
I assume that in the current design you would be doing something similar (treating chararray the same way as bytearray) until String operations need
to be done.

Thanks,
Thejas




On 5/14/09 5:33 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

http://wiki.apache.org/pig/PigMemory

Alan.


Reply via email to