Alex Levenson created PARQUET-33:
------------------------------------

             Summary: Benchmark the assembly of thrift objects, and possibly 
create a more efficient ReplayingTProtocol
                 Key: PARQUET-33
                 URL: https://issues.apache.org/jira/browse/PARQUET-33
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Alex Levenson
            Priority: Minor


The current implementation of parquet thrift creates an instance of TProtocol 
for each value of each record and builds a stack of these events, which are 
then replayed back to the TBase.

I'd be curious to benchmark this, and if it's slow, try building a 
"ReplayingTProtocol" that instead of having a stack of TProtocol instances, 
contains a primitive array of each type. As events are fed into this replaying 
TProtocol, it would just add these primitives to its buffers, and then the 
TBase would drain them. This would effectively let us stream the values into 
the TBase without making an object allocation for each value.

The buffers could be set to a certain size, and if they fill up (which they 
sholdn't in most cases), the TBase could begin draining the protocol until it 
is empty again, at which point the TProtocol can block the TBase from draining 
further while the parque record assembly feeds it more events.

This is all moot if it turns out not to be bottleneck though :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to