[
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035330#comment-17035330
]
ASF GitHub Bot commented on DRILL-7578:
---------------------------------------
cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with
Large Files
URL: https://github.com/apache/drill/pull/1978#issuecomment-585184636
@vvysotskyi Let me give you some context..
This plugin has two ways of interacting with HDF5 files: metadata queries
and dataset queries. HDF5 is like a filesystem within a file, so it can
contain many datasets. The dataset query looks at a specific dataset and
projects the columns and rows as you would expect.
Metadata queries are intended to explore the HDF5 itself rather than an
individual dataset. As currently implemented, in metadata queries, the plugin
will return the filename, paths, dataset types, from the HDF5 file. Here's
where the problem arose... The metadata query also maps each dataset to a cell
in each row. This is useful because the user gets a preview of the data that
is actually in each dataset, however if that dataset is larger than 16MB, Drill
crashes. When I originally implemented this (before EVF) this wasn't an issue
because the plugin itself handled pushdown projection, and therefore all the
user had to do was exclude the dataset from the query. However, with EVF it
doesn't work that way.
Therefore options are:
1. Remove this preview functionality entirely
2. Select some small amount from each dataset and project that in a
metadata query
3. Add a config option to not generate the preview columns in metadata
querires.
4. Convert preview to a string and truncate at size limit.
Of these options, option 3 felt the easiest and most useful to me as it
preserved the functionality and gave the users a way to make it work.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> HDF5 Metadata Queries Fail with Large Files
> -------------------------------------------
>
> Key: DRILL-7578
> URL: https://issues.apache.org/jira/browse/DRILL-7578
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.18.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large
> datasets in the metadata.
> This PR adds a configuration option which removes the dataset projection from
> metadata queries and fixes this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)