paul-rogers commented on a change in pull request #1978: DRILL-7578: HDF5 
Metadata Queries Fail with Large Files
URL: https://github.com/apache/drill/pull/1978#discussion_r380490693
 
 

 ##########
 File path: 
contrib/format-hdf5/src/test/java/org/apache/drill/exec/store/hdf5/TestHDF5Format.java
 ##########
 @@ -98,9 +98,9 @@ public void testStarQuery() throws Exception {
 
     testBuilder()
       .sqlQuery("SELECT * FROM dfs.`hdf5/dset.h5`")
-      .unOrdered()
-      .baselineColumns("path", "data_type", "file_name", "int_data")
-      .baselineValues("/dset", "DATASET", "dset.h5", finalList)
+      .ordered()
 
 Review comment:
   This might be the place to ask the question about schema. We have two 
distinct views of a data set. The general rule of the wildcard (`*`) is to 
return all available columns. Here, we special-case wildcard to mean "return 
metadata." This is, unfortunately, very non standard.
   
   We need some way to express two views of the file. The same problem occurs 
for any database. We could even use if for JSON, CSV and other file formats.
   
   The challenge is, how do we tell the query we want metadata and not data? In 
a normal DB, we query system tables. Perhaps we could jimmy up something in 
Drill:
   
   ```
   SELECT * FROM sys.schema.dfs.`hdf5/dset.h5`
   ```
   
   Or, maybe think of the table as a namespace, and have an optional `.schema` 
tail:
   
   ```
   SELECT * FROM dfs.`hdf5/dset.h5`.schema
   ```
   
   The point is not for you to implement this, or even to design the solution. 
Rather, the point is that the current solution is a hack, and that we need a 
better solution.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to