[ https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035566#comment-17035566 ]
ASF GitHub Bot commented on DRILL-7578: --------------------------------------- paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#issuecomment-585341741 A bit confused by the crash on 16MB part. The problem description is vague. Is there a stack trace somewhere? EVF is designed to limit individual vectors to 16MB. Once you hit that size, EVF does an "overflow" move: it copies the last record (the one that does not fit) into a new batch, then tell you to return the now-full batch. If you are seeing a crash, it could be that there is a bit in the overflow logic. (That logic is quite complex.) The proper fix, then, would be for me to find and fix that bug. Regarding projection: yes, EVF handles projection. You can ask for writers for all your columns, EVF gives you a "dummy" writer for those that are not projected. While top-level columns can be handled by a plugin easily (just set some flags, say), nested columns are very hard to implement in the plugin. EVF provides a uniform way to handle projection at all levels. And, for top level arrays such as `column`, EVF also handles per-element projection. As a result, the only difference between EVF-based projection and roll-you-own is that, with EVF, the easiest path is to read the data, give it to the column writer, and let the column writer throw it away. Works well for sequential formats such as JSON and CSV. If your format is random-access (you have to request each column, as in Parquet), then it is better to ask if the column is projected. But, if your data structure is nested, you have to do this at each level. So, with that explanation out of the way, what about EVF projection is not working the way roll-your-own did? Let's figure that out and fix it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > HDF5 Metadata Queries Fail with Large Files > ------------------------------------------- > > Key: DRILL-7578 > URL: https://issues.apache.org/jira/browse/DRILL-7578 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.18.0 > Reporter: Charles Givre > Assignee: Charles Givre > Priority: Major > Fix For: 1.18.0 > > > With large files, Drill runs out of memory when attempting to project large > datasets in the metadata. > This PR adds a configuration option which removes the dataset projection from > metadata queries and fixes this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)