With a constraint that all scalar values in a tuple should fit into a single buffer, the values will always have to be copied whenever a tuple contents need to be copied to a new tuple after a relational operation.
The overhead of copying is not large for numeric types compared to the existing implementation, because we already copy the object references. But it can be large overhead for chararray/bytearray data types that are long enough. To avoid this performance penalty, we should not require these larger datatypes to be stored in the same buffer, and maybe follow the design in current implemenation for those, ie store them in java objects. To prevent the bloating issue when 8byte chars are stored in String objects, we can delay their conversion into String objects and store them like bytearray until some String operation needs to be done. For any memory intensive operations like join, we can store them again as bytearray. I assume that in the current design you would be doing something similar (treating chararray the same way as bytearray) until String operations need to be done. Thanks, Thejas On 5/14/09 5:33 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote: > http://wiki.apache.org/pig/PigMemory > > Alan.