[
https://issues.apache.org/jira/browse/PIG-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743148#comment-13743148
]
Jonathan Packer commented on PIG-3429:
--------------------------------------
Posted to ReviewBoard: https://reviews.apache.org/r/13630/
> Reduce Pig memory footprint using specialized tuple classes (complementary to
> SchemaTuple)
> ------------------------------------------------------------------------------------------
>
> Key: PIG-3429
> URL: https://issues.apache.org/jira/browse/PIG-3429
> Project: Pig
> Issue Type: Improvement
> Components: data
> Affects Versions: 0.12
> Reporter: Jonathan Packer
> Attachments: PIG-3429-v1.diff
>
>
> Pig's default tuple implementation is very memory inefficient for small
> tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags
> being spilled more often than they need to. SchemaTuple addresses this, but
> is not fully integrated into the PhysicalPlan pipeline (and seems like it
> would be difficult to do so). Furthermore, it is likely that almost all UDFs
> do not use SchemaTuple.
> This patch therefore provides some basic optimizations to reduce memory
> footprint of tuples by having BinSedesTupleFactory construct specialized
> tuple implementations in certain circumstances. This way, anything using
> BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a
> different factory, it will not be interfered with.
> There is a long description below, because this patch might break stuff. I
> tried to think through possible implementation hazards which I will list.
> The specialized tuple implementations are as follows:
> EmptyTuple // no fields, just an object header = 8 bytes
> NullWrapperTuple // wraps a single null field, 8 bytes
> CountingTuple // replaces (1L) as initial output of COUNT, 8 bytes
> IntegerWrapperTuple // these all wrap a single primitive field
> LongWrapperTuple // object header + rounded primitive size = 16 bytes
> FloatWrapperTuple
> DoubleWrapperTuple
> BinSedesTuple2 // these are pair/triples of fields with no ArrayList
> BinSedesTuple3 // 16/24 bytes of overhead as opposed to 80 from ArrayList
> The memory savings are greatest for the algebraic math functions COUNT, SUM,
> etc. For example, the size of an intermediate tuple for COUNT should go from
> 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go
> from 112 bytes to 16 bytes.
> I haven't finished running the full unit-tests, but TestAlgebraicEval passes
> so I'm hopeful it will be manageable to debug.
> The three concerns that I have are:
> 1) Since TupleFactory now sometimes outputs non-appendable tuples, the
> isFixedSize() method had to be removed. A file search didn't show it being
> used anywhere though. I think appending to tuples instead of finding out the
> requisite size ahead of time is bad practice as well (I changed POForeach to
> do the latter so it can take advantage of the special tuple impls).
> 2) Also since TupleFactory now has multiple tuple types, the tupleClass()
> method gets tricky. I made a superclass GenericBinSedesTuple that all the
> specialized classes inherit from, and it seems to work, but I'm not sure what
> the implications of this are. It breaks the inheritance tree of AbstractTuple
> <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits
> directly from GenericBinSedesTuple and DefaultTuple is left unused. In the
> patch, all the stuff for DefaultBinSedesTuple is just copied over from the
> old DefaultTuple.
> 3) I tried to be careful not to break BinInterSedesTupleRawComparator, but
> this will need verification.
> Finally,
> 4) For my personal use cases, I'd like to make custom tuple implementations
> like SparseMatrixTuple or FeatureVectorTuple. Would people be opposed to
> making some "hooks" in BinInterSedes for user-defined tuple types? I was
> thinking there could be some config which maps these hooks (data type bytes)
> to user-defined classes and uses reflection to instantiate and read them. Not
> sure if that would be performant though.
> Thanks for reading all that!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira