[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

ASF GitHub Bot (Jira) Mon, 17 Feb 2020 23:09:14 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038834#comment-17038834
 ]


ASF GitHub Bot commented on DRILL-7578:
---------------------------------------

paul-rogers commented on pull request #1978: DRILL-7578: HDF5 Metadata Queries 
Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380490693
 
 

 ##########
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##########
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
     testBuilder()
       .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-      .unOrdered()
-      .baselineColumns("path", "data_type", "file_name", "int_data")
-      .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+      .ordered()
 
 Review comment:
   This might be the place to ask the question about schema. We have two 
distinct views of a data set. The general rule of the wildcard (`*`) is to 
return all available columns. Here, we special-case wildcard to mean "return 
metadata." This is, unfortunately, very non standard.
   
   We need some way to express two views of the file. The same problem occurs 
for any database. We could even use if for JSON, CSV and other file formats.
   
   The challenge is, how do we tell the query we want metadata and not data? In 
a normal DB, we query system tables. Perhaps we could jimmy up something in 
Drill:
   
   ```
   SELECT * FROM sys.schema.dfs.`hdf5/dset.h5`
   ```
   
   Or, maybe think of the table as a namespace, and have an optional `.schema` 
tail:
   
   ```
   SELECT * FROM dfs.`hdf5/dset.h5`.schema
   ```
   
   The point is not for you to implement this, or even to design the solution. 
Rather, the point is that the current solution is a hack, and that we need a 
better solution.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> HDF5 Metadata Queries Fail with Large Files
> -------------------------------------------
>
>                 Key: DRILL-7578
>                 URL: https://issues.apache.org/jira/browse/DRILL-7578
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.18.0
>
>
> With large files, Drill runs out of memory when attempting to project large 
> datasets in the metadata.  
> This PR adds a configuration option which removes the dataset projection from 
> metadata queries and fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7578) HDF5 Metadata Queries Fail with Large Files

Reply via email to