[ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035566#comment-17035566
 ] 

ASF GitHub Bot commented on DRILL-7578:
---------------------------------------

paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail 
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585341741
 
 
   A bit confused by the crash on 16MB part. The problem description is vague. 
Is there a stack trace somewhere?
   
   EVF is designed to limit individual vectors to 16MB. Once you hit that size, 
EVF does an "overflow" move: it copies the last record (the one that does not 
fit) into a new batch, then tell you to return the now-full batch.
   
   If you are seeing a crash, it could be that there is a bit in the overflow 
logic. (That logic is quite complex.) The proper fix, then, would be for me to 
find and fix that bug.
   
   Regarding projection: yes, EVF handles projection. You can ask for writers 
for all your columns, EVF gives you a "dummy" writer for those that are not 
projected. While top-level columns can be handled by a plugin easily (just set 
some flags, say), nested columns are very hard to implement in the plugin. EVF 
provides a uniform way to handle projection at all levels. And, for top level 
arrays such as `column`, EVF also handles per-element projection.
   
   As a result, the only difference between EVF-based projection and 
roll-you-own is that, with EVF, the easiest path is to read the data, give it 
to the column writer, and let the column writer throw it away. Works well for 
sequential formats such as JSON and CSV.
   
   If your format is random-access (you have to request each column, as in 
Parquet), then it is better to ask if the column is projected. But, if your 
data structure is nested, you have to do this at each level.
   
   So, with that explanation out of the way, what about EVF projection is not 
working the way roll-your-own did? Let's figure that out and fix it.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HDF5 Metadata Queries Fail with Large Files
> -------------------------------------------
>
>                 Key: DRILL-7578
>                 URL: https://issues.apache.org/jira/browse/DRILL-7578
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to