Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1060 Always a good idea to suggest an alternative in addition to identifying challenges. I wonder if the code can resolve the questions raised by taking a someone different approach: 1. Rather than vectors pulling in data, create a new abstraction that does so. It loads a heap buffer with values until either a) filling the buffer, or b) reaching a defined value limit. 2. Once all vector buffers are full, make a pass over all columns to determine how much will fit into the target vectors before overflow. (That is, if the buffer holds 1K values, but only 200 of column C fits before overflow, then 200 becomes the vector fill number.) Let's call this N. 3. Make another pass, copying N values from each buffer into the target vectors. If that leaves values in the buffers, hold those values for the next pass (by shifting them down or implementing a circular buffer.) The above handles vector size limits. It has the advantage of turning "Drill can't predict the future" (can't know value lengths until they are read) into an exercise in examining the past: which values were read into buffers. It is a bit harder to imagine how to honor batch size limits since the limit is computed over all vectors, not just one. There is the crude-but-effective approach in step 2: 2. In the pass above to compute N (the number of values to copy into vectors), go row by row, accumulating totals (as the "row set loader" does on actual writes.) Once the limit is set, back off by one to get N (the last count that completely fits within a row.) Now, since the above requires row-by-row calculations, it requires the very work that this PR seeks to avoid (but does not do actual data transfers.) The question, then, is this: is there a refinement to the above algorithm that can be done non-iteratively (without doing a row-by-row check?) And, with all the extra machinery needed to honor limits, will we still see a significant speedup relative to a straightforward row-by-row read?
---