[
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035566#comment-17035566
]
ASF GitHub Bot commented on DRILL-7578:
---------------------------------------
paul-rogers commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail
with Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585341741
A bit confused by the crash on 16MB part. The problem description is vague.
Is there a stack trace somewhere?
EVF is designed to limit individual vectors to 16MB. Once you hit that size,
EVF does an "overflow" move: it copies the last record (the one that does not
fit) into a new batch, then tell you to return the now-full batch.
If you are seeing a crash, it could be that there is a bit in the overflow
logic. (That logic is quite complex.) The proper fix, then, would be for me to
find and fix that bug.
Regarding projection: yes, EVF handles projection. You can ask for writers
for all your columns, EVF gives you a "dummy" writer for those that are not
projected. While top-level columns can be handled by a plugin easily (just set
some flags, say), nested columns are very hard to implement in the plugin. EVF
provides a uniform way to handle projection at all levels. And, for top level
arrays such as `column`, EVF also handles per-element projection.
As a result, the only difference between EVF-based projection and
roll-you-own is that, with EVF, the easiest path is to read the data, give it
to the column writer, and let the column writer throw it away. Works well for
sequential formats such as JSON and CSV.
If your format is random-access (you have to request each column, as in
Parquet), then it is better to ask if the column is projected. But, if your
data structure is nested, you have to do this at each level.
So, with that explanation out of the way, what about EVF projection is not
working the way roll-your-own did? Let's figure that out and fix it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> HDF5 Metadata Queries Fail with Large Files
> -------------------------------------------
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.18.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large
> datasets in the metadata.
> This PR adds a configuration option which removes the dataset projection from
> metadata queries and fixes this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)