paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585341741
 
 
   A bit confused by the crash on 16MB part. The problem description is vague. 
Is there a stack trace somewhere?
   
   EVF is designed to limit individual vectors to 16MB. Once you hit that size, 
EVF does an "overflow" move: it copies the last record (the one that does not 
fit) into a new batch, then tell you to return the now-full batch.
   
   If you are seeing a crash, it could be that there is a bit in the overflow 
logic. (That logic is quite complex.) The proper fix, then, would be for me to 
find and fix that bug.
   
   Regarding projection: yes, EVF handles projection. You can ask for writers 
for all your columns, EVF gives you a "dummy" writer for those that are not 
projected. While top-level columns can be handled by a plugin easily (just set 
some flags, say), nested columns are very hard to implement in the plugin. EVF 
provides a uniform way to handle projection at all levels. And, for top level 
arrays such as `column`, EVF also handles per-element projection.
   
   As a result, the only difference between EVF-based projection and 
roll-you-own is that, with EVF, the easiest path is to read the data, give it 
to the column writer, and let the column writer throw it away. Works well for 
sequential formats such as JSON and CSV.
   
   If your format is random-access (you have to request each column, as in 
Parquet), then it is better to ask if the column is projected. But, if your 
data structure is nested, you have to do this at each level.
   
   So, with that explanation out of the way, what about EVF projection is not 
working the way roll-your-own did? Let's figure that out and fix it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to