We definitely do not want to follow the current design of keeping
chararrays and bytearrays as separate objects. It is that overhead of
an object for each field that we are trying to avoid.
The reason for constraining a tuple to store its data in one
TupleBuffer is to limit the size of the Tuple object. If the data can
span TupleBuffers, then a Tuple has to a TupleBuffer reference for
each field, almost doubling the size of a Tuple.
It is worth noting that a number of pig operators (sort, filter,
distinct) need not copy any data. Some have to copy (foreach, though
this could be optimized not to in certain cases such as simple
projections), and some could get away without copying if tuples could
reference fields across buffers (join, union, cross).
I think we should do some experimentation with this and see:
1) What is the memory saving from moving from objects to buffers, ie
is this worth it at all.
2) What is the additional memory cost of storing a TupleBuffer
reference per field.
3) What is the performance penalty of copying data on joins, unions,
Once these are known it should be easier to make trade off decisions.
There is one other option. I had said that it would be better to have
a few very large (on the order of 10M) buffers. The reasons I
considered that was that I didn't want so many buffers themselves that
managing them became a burden on the system, and that we have to
somehow handle the case of chararray, bytearray, or map fields that
won't fit in a single TupleBuffer (assumably by storing those as an
object instead of in the buffer). The larger we make the buffers the
less we encounter this issue.
Instead of using large buffers we could use smaller ones. If we
capped the size of a buffer at 65K and the number of buffers a single
tuple could reference at 65K, then a tuple could still see 4G of
memory but still only use 4 bytes per field to point to the data.
This way join operations could be done without copying the data. This
option should be experimented with as well. It may be that using
smaller buffers is better since the cost of reading and writing them
on disk will be less.
On May 15, 2009, at 11:23 AM, Thejas Nair wrote:
With a constraint that all scalar values in a tuple should fit into
buffer, the values will always have to be copied whenever a tuple
need to be copied to a new tuple after a relational operation.
The overhead of copying is not large for numeric types compared to the
existing implementation, because we already copy the object
it can be large overhead for chararray/bytearray data types that
To avoid this performance penalty, we should not require these larger
datatypes to be stored in the same buffer, and maybe follow the
current implemenation for those, ie store them in java objects.
To prevent the bloating issue when 8byte chars are stored in String
we can delay their conversion into String objects and store them like
bytearray until some String operation needs to be done. For any memory
intensive operations like join, we can store them again as bytearray.
I assume that in the current design you would be doing something
(treating chararray the same way as bytearray) until String
to be done.
On 5/14/09 5:33 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote: