Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/1060
Always a good idea to suggest an alternative in addition to identifying
challenges. I wonder if the code can resolve the questions raised by taking a
someone different approach:
1. Rather than vectors pulling in data, create a new abstraction that does
so. It loads a heap buffer with values until either a) filling the buffer, or
b) reaching a defined value limit.
2. Once all vector buffers are full, make a pass over all columns to
determine how much will fit into the target vectors before overflow. (That is,
if the buffer holds 1K values, but only 200 of column C fits before overflow,
then 200 becomes the vector fill number.) Let's call this N.
3. Make another pass, copying N values from each buffer into the target
vectors. If that leaves values in the buffers, hold those values for the next
pass (by shifting them down or implementing a circular buffer.)
The above handles vector size limits. It has the advantage of turning
"Drill can't predict the future" (can't know value lengths until they are read)
into an exercise in examining the past: which values were read into buffers.
It is a bit harder to imagine how to honor batch size limits since the
limit is computed over all vectors, not just one. There is the
crude-but-effective approach in step 2:
2. In the pass above to compute N (the number of values to copy into
vectors), go row by row, accumulating totals (as the "row set loader" does on
actual writes.) Once the limit is set, back off by one to get N (the last count
that completely fits within a row.)
Now, since the above requires row-by-row calculations, it requires the very
work that this PR seeks to avoid (but does not do actual data transfers.)
The question, then, is this: is there a refinement to the above algorithm
that can be done non-iteratively (without doing a row-by-row check?)
And, with all the extra machinery needed to honor limits, will we still see
a significant speedup relative to a straightforward row-by-row read?
---