[
https://issues.apache.org/jira/browse/PIG-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745007#comment-13745007
]
Jonathan Packer commented on PIG-3429:
--------------------------------------
Hi, so the current patch now seems to pass every unit tests except ones which
use tuple's append() method which breaks. I have an idea for fixing this, but
wanted to wait for feedback to make sure I'm going in the right direction. I
know this is changing some important classes, but I think the memory
improvements could especially help make Pig local mode more viable for
general-purpose use as memory is more of an issue on laptops then on clusters.
My idea for fixing append() is that for the specialized tuple impls, they have
an extra field "Tuple promotedTuple". This is null by default, so it only adds
8 bytes of overhead (still much cheaper than the ArrayList when it is unused).
If someone needs to append to the specialized tuple, the existing fields are
copied into a new default tuple in the "promotedTuple" field and that is just
used by proxy. So there is a small overhead vs default when use append, but for
most cases where append is not used you retain the memory savings of the
specialized tuples. Does this seem like an workable idea?
> Reduce Pig memory footprint using specialized tuple classes (complementary to
> SchemaTuple)
> ------------------------------------------------------------------------------------------
>
> Key: PIG-3429
> URL: https://issues.apache.org/jira/browse/PIG-3429
> Project: Pig
> Issue Type: Improvement
> Components: data
> Affects Versions: 0.12
> Reporter: Jonathan Packer
> Assignee: Jonathan Packer
> Attachments: PIG-3429-v1.diff, PIG-3429-v2.diff
>
>
> Pig's default tuple implementation is very memory inefficient for small
> tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags
> being spilled more often than they need to. SchemaTuple addresses this, but
> is not fully integrated into the PhysicalPlan pipeline (and seems like it
> would be difficult to do so). Furthermore, it is likely that almost all UDFs
> do not use SchemaTuple.
> This patch therefore provides some basic optimizations to reduce memory
> footprint of tuples by having BinSedesTupleFactory construct specialized
> tuple implementations in certain circumstances. This way, anything using
> BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a
> different factory, it will not be interfered with.
> There is a long description below, because this patch might break stuff. I
> tried to think through possible implementation hazards which I will list.
> The specialized tuple implementations are as follows:
> EmptyTuple // no fields, just an object header = 8 bytes
> NullWrapperTuple // wraps a single null field, 8 bytes
> CountingTuple // replaces (1L) as initial output of COUNT, 8 bytes
> IntegerWrapperTuple // these all wrap a single primitive field
> LongWrapperTuple // object header + rounded primitive size = 16 bytes
> FloatWrapperTuple
> DoubleWrapperTuple
> BinSedesTuple2 // these are pair/triples of fields with no ArrayList
> BinSedesTuple3 // 16/24 bytes of overhead as opposed to 80 from ArrayList
> The memory savings are greatest for the algebraic math functions COUNT, SUM,
> etc. For example, the size of an intermediate tuple for COUNT should go from
> 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go
> from 112 bytes to 16 bytes.
> I haven't finished running the full unit-tests, but TestAlgebraicEval passes
> so I'm hopeful it will be manageable to debug.
> The three concerns that I have are:
> 1) Since TupleFactory now sometimes outputs non-appendable tuples, the
> isFixedSize() method had to be removed. A file search didn't show it being
> used anywhere though. I think appending to tuples instead of finding out the
> requisite size ahead of time is bad practice as well (I changed POForeach to
> do the latter so it can take advantage of the special tuple impls).
> 2) Also since TupleFactory now has multiple tuple types, the tupleClass()
> method gets tricky. I made a superclass GenericBinSedesTuple that all the
> specialized classes inherit from, and it seems to work, but I'm not sure what
> the implications of this are. It breaks the inheritance tree of AbstractTuple
> <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits
> directly from GenericBinSedesTuple and DefaultTuple is left unused. In the
> patch, all the stuff for DefaultBinSedesTuple is just copied over from the
> old DefaultTuple.
> 3) I tried to be careful not to break BinInterSedesTupleRawComparator, but
> this will need verification.
> Finally,
> 4) For my personal use cases, I'd like to make custom tuple implementations
> like SparseMatrixTuple or FeatureVectorTuple. Would people be opposed to
> making some "hooks" in BinInterSedes for user-defined tuple types? I was
> thinking there could be some config which maps these hooks (data type bytes)
> to user-defined classes and uses reflection to instantiate and read them. Not
> sure if that would be performant though.
> Thanks for reading all that!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira