[ 
https://issues.apache.org/jira/browse/PIG-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743148#comment-13743148
 ] 

Jonathan Packer commented on PIG-3429:
--------------------------------------

Posted to ReviewBoard: https://reviews.apache.org/r/13630/
                
> Reduce Pig memory footprint using specialized tuple classes (complementary to 
> SchemaTuple)
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-3429
>                 URL: https://issues.apache.org/jira/browse/PIG-3429
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>    Affects Versions: 0.12
>            Reporter: Jonathan Packer
>         Attachments: PIG-3429-v1.diff
>
>
> Pig's default tuple implementation is very memory inefficient for small 
> tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags 
> being spilled more often than they need to. SchemaTuple addresses this, but 
> is not fully integrated into the PhysicalPlan pipeline (and seems like it 
> would be difficult to do so). Furthermore, it is likely that almost all UDFs 
> do not use SchemaTuple.
> This patch therefore provides some basic optimizations to reduce memory 
> footprint of tuples by having BinSedesTupleFactory construct specialized 
> tuple implementations in certain circumstances. This way, anything using 
> BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a 
> different factory, it will not be interfered with.
> There is a long description below, because this patch might break stuff. I 
> tried to think through possible implementation hazards which I will list.
> The specialized tuple implementations are as follows:
> EmptyTuple          // no fields, just an object header = 8 bytes
> NullWrapperTuple    // wraps a single null field, 8 bytes
> CountingTuple       // replaces (1L) as initial output of COUNT, 8 bytes
> IntegerWrapperTuple // these all wrap a single primitive field
> LongWrapperTuple    // object header + rounded primitive size = 16 bytes
> FloatWrapperTuple
> DoubleWrapperTuple
> BinSedesTuple2      // these are pair/triples of fields with no ArrayList
> BinSedesTuple3      // 16/24 bytes of overhead as opposed to 80 from ArrayList
> The memory savings are greatest for the algebraic math functions COUNT, SUM, 
> etc. For example, the size of an intermediate tuple for COUNT should go from 
> 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go 
> from 112 bytes to 16 bytes.
> I haven't finished running the full unit-tests, but TestAlgebraicEval passes 
> so I'm hopeful it will be manageable to debug.
> The three concerns that I have are:
> 1) Since TupleFactory now sometimes outputs non-appendable tuples, the 
> isFixedSize() method had to be removed. A file search didn't show it being 
> used anywhere though. I think appending to tuples instead of finding out the 
> requisite size ahead of time is bad practice as well (I changed POForeach to 
> do the latter so it can take advantage of the special tuple impls).
> 2) Also since TupleFactory now has multiple tuple types, the tupleClass() 
> method gets tricky. I made a superclass GenericBinSedesTuple that all the 
> specialized classes inherit from, and it seems to work, but I'm not sure what 
> the implications of this are. It breaks the inheritance tree of AbstractTuple 
> <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits 
> directly from GenericBinSedesTuple and DefaultTuple is left unused. In the 
> patch, all the stuff for DefaultBinSedesTuple is just copied over from the 
> old DefaultTuple.
> 3) I tried to be careful not to break BinInterSedesTupleRawComparator, but 
> this will need verification.
> Finally,
> 4) For my personal use cases, I'd like to make custom tuple implementations 
> like SparseMatrixTuple or FeatureVectorTuple. Would people be opposed to 
> making some "hooks" in BinInterSedes for user-defined tuple types? I was 
> thinking there could be some config which maps these hooks (data type bytes) 
> to user-defined classes and uses reflection to instantiate and read them. Not 
> sure if that would be performant though.
> Thanks for reading all that!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to