cgivre commented on issue #1978: DRILL-7578: HDF5 Metadata Queries Fail with Large Files URL: https://github.com/apache/drill/pull/1978#issuecomment-585184636 @vvysotskyi Let me give you some context.. This plugin has two ways of interacting with HDF5 files: metadata queries and dataset queries. HDF5 is like a filesystem within a file, so it can contain many datasets. The dataset query looks at a specific dataset and projects the columns and rows as you would expect. Metadata queries are intended to explore the HDF5 itself rather than an individual dataset. As currently implemented, in metadata queries, the plugin will return the filename, paths, dataset types, from the HDF5 file. Here's where the problem arose... The metadata query also maps each dataset to a cell in each row. This is useful because the user gets a preview of the data that is actually in each dataset, however if that dataset is larger than 16MB, Drill crashes. When I originally implemented this (before EVF) this wasn't an issue because the plugin itself handled pushdown projection, and therefore all the user had to do was exclude the dataset from the query. However, with EVF it doesn't work that way. Therefore options are: 1. Remove this preview functionality entirely 2. Select some small amount from each dataset and project that in a metadata query 3. Add a config option to not generate the preview columns in metadata querires. 4. Convert preview to a string and truncate at size limit. Of these options, option 3 felt the easiest and most useful to me as it preserved the functionality and gave the users a way to make it work.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
