[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

paul-rogers Sat, 23 Dec 2017 21:09:03 -0800

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1060
  
    Always a good idea to suggest an alternative in addition to identifying 
challenges. I wonder if the code can resolve the questions raised by taking a 
someone different approach:
    
    1. Rather than vectors pulling in data, create a new abstraction that does 
so. It loads a heap buffer with values until either a) filling the buffer, or 
b) reaching a defined value limit.
    2. Once all vector buffers are full, make a pass over all columns to 
determine how much will fit into the target vectors before overflow. (That is, 
if the buffer holds 1K values, but only 200 of column C fits before overflow, 
then 200 becomes the vector fill number.) Let's call this N.
    3. Make another pass, copying N values from each buffer into the target 
vectors. If that leaves values in the buffers, hold those values for the next 
pass (by shifting them down or implementing a circular buffer.)
    
    The above handles vector size limits. It has the advantage of turning 
"Drill can't predict the future" (can't know value lengths until they are read) 
into an exercise in examining the past: which values were read into buffers.
    
    It is a bit harder to imagine how to honor batch size limits since the 
limit is computed over all vectors, not just one. There is the 
crude-but-effective approach in step 2:
    
    2. In the pass above to compute N (the number of values to copy into 
vectors), go row by row, accumulating totals (as the "row set loader" does on 
actual writes.) Once the limit is set, back off by one to get N (the last count 
that completely fits within a row.)
    
    Now, since the above requires row-by-row calculations, it requires the very 
work that this PR seeks to avoid (but does not do actual data transfers.)
    
    The question, then, is this: is there a refinement to the above algorithm 
that can be done non-iteratively (without doing a row-by-row check?)
    
    And, with all the extra machinery needed to honor limits, will we still see 
a significant speedup relative to a straightforward row-by-row read?

---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

Reply via email to