Re: [jira] [Updated] (DRILL-7233) Format Plugin for HDF5

Charles Givre Wed, 20 Nov 2019 05:29:32 -0800

Since there's a blocking issue for Drill 1.17 and this is pretty close, could 
we attempt to get this into v 1.17?
Thx,
-- C



> On Nov 20, 2019, at 4:52 AM, Arina Ielchiieva (Jira) <[email protected]> wrote:
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Arina Ielchiieva updated DRILL-7233:
> ------------------------------------
>    Labels: doc-impacting  (was: doc-impacting ready-to-commit)
> 
>> Format Plugin for HDF5
>> ----------------------
>> 
>>                Key: DRILL-7233
>>                URL: https://issues.apache.org/jira/browse/DRILL-7233
>>            Project: Apache Drill
>>         Issue Type: New Feature
>>   Affects Versions: 1.17.0
>>           Reporter: Charles Givre
>>           Assignee: Charles Givre
>>           Priority: Major
>>             Labels: doc-impacting
>>            Fix For: 1.18.0
>> 
>> 
>> h2. Drill HDF5 Format Plugin
>> h2. 
>> Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats 
>> designed to store and organize large amounts of data. Originally developed 
>> at the National Center for Supercomputing Applications, it is supported by 
>> The HDF Group, a non-profit corporation whose mission is to ensure continued 
>> development of HDF5 technologies and the continued accessibility of data 
>> stored in HDF.
>> This plugin enables Apache Drill to query HDF5 files.
>> h3. Configuration
>> There are three configuration variables in this plugin:
>> type: This should be set to hdf5.
>> extensions: This is a list of the file extensions used to identify HDF5 
>> files. Typically HDF5 uses .h5 or .hdf5 as file extensions. This defaults to 
>> .h5.
>> defaultPath:
>> h3. Example Configuration
>> h3. 
>> For most uses, the configuration below will suffice to enable Drill to query 
>> HDF5 files.
>> {{"hdf5": {
>>      "type": "hdf5",
>>      "extensions": [
>>        "h5"
>>      ],
>>      "defaultPath": null
>>    }}}
>> h3. Usage
>> Since HDF5 can be viewed as a file system within a file, a single file can 
>> contain many datasets. For instance, if you have a simple HDF5 file, a star 
>> query will produce the following result:
>> {{apache drill> select * from dfs.test.`dset.h5`;
>> +-------+-----------+-----------+--------------------------------------------------------------------------+
>> | path  | data_type | file_name |                                 int_data   
>>                               |
>> +-------+-----------+-----------+--------------------------------------------------------------------------+
>> | /dset | DATASET   | dset.h5   | 
>> [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
>> +-------+-----------+-----------+--------------------------------------------------------------------------+}}
>> The actual data in this file is mapped to a column called int_data. In order 
>> to effectively access the data, you should use Drill's FLATTEN() function on 
>> the int_data column, which produces the following result.
>> {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`;
>> +---------------------+
>> |      int_data       |
>> +---------------------+
>> | [1,2,3,4,5,6]       |
>> | [7,8,9,10,11,12]    |
>> | [13,14,15,16,17,18] |
>> | [19,20,21,22,23,24] |
>> +---------------------+}}
>> Once you have the data in this form, you can access it similarly to how you 
>> might access nested data in JSON or other files.
>> {{apache drill> SELECT int_data[0] as col_0,
>> . .semicolon> int_data[1] as col_1,
>> . .semicolon> int_data[2] as col_2
>> . .semicolon> FROM ( SELECT flatten(int_data) AS int_data
>> . . . . . .)> FROM dfs.test.`dset.h5`
>> . . . . . .)> );
>> +-------+-------+-------+
>> | col_0 | col_1 | col_2 |
>> +-------+-------+-------+
>> | 1     | 2     | 3     |
>> | 7     | 8     | 9     |
>> | 13    | 14    | 15    |
>> | 19    | 20    | 21    |
>> +-------+-------+-------+}}
>> Alternatively, a better way to query the actual data in an HDF5 file is to 
>> use the defaultPath field in your query. If the defaultPath field is defined 
>> in the query, or via the plugin configuration, Drill will only return the 
>> data, rather than the file metadata.
>> ** Note: Once you have determined which data set you are querying, it is 
>> advisable to use this method to query HDF5 data. **
>> You can set the defaultPath variable in either the plugin configuration, or 
>> at query time using the table() function as shown in the example below:
>> {{SELECT * 
>> FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}}
>> This query will return the result below:
>> {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', 
>> defaultPath => '/dset'));
>> +-----------+-----------+-----------+-----------+-----------+-----------+
>> | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
>> +-----------+-----------+-----------+-----------+-----------+-----------+
>> | 1         | 2         | 3         | 4         | 5         | 6         |
>> | 7         | 8         | 9         | 10        | 11        | 12        |
>> | 13        | 14        | 15        | 16        | 17        | 18        |
>> | 19        | 20        | 21        | 22        | 23        | 24        |
>> +-----------+-----------+-----------+-----------+-----------+-----------+
>> 4 rows selected (0.223 seconds)}}
>> If the data in defaultPath is a column, the column name will be the last 
>> part of the path. If the data is multidimensional, the columns will get a 
>> name of <data_type>_col_n . Therefore a column of integers will be called 
>> int_col_1.
>> h3. Attributes
>> Occasionally, HDF5 paths will contain attributes. Drill will map these to a 
>> map data structure called attributes, as shown in the query below.
>> {{apache drill> SELECT attributes FROM dfs.test.`browsing.h5`;
>> +----------------------------------------------------------------------------------+
>> |                                    attributes                              
>>       |
>> +----------------------------------------------------------------------------------+
>> | {}                                                                         
>>       |
>> | {"__TYPE_VARIANT__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH"}     
>>       |
>> | {}                                                                         
>>       |
>> | {}                                                                         
>>       |
>> | 
>> {"important":false,"__TYPE_VARIANT__timestamp__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH","timestamp":1550033296762}
>>  |
>> | {}                                                                         
>>       |
>> | {}                                                                         
>>       |
>> | {}                                                                         
>>       |
>> +----------------------------------------------------------------------------------+
>> 8 rows selected (0.292 seconds)}}
>> You can access the individual fields within the attributes map by using the 
>> structure table.map.key. Note that you will have to give the table an alias 
>> for this to work properly.
>> {{apache drill> SELECT path, data_type, file_name  
>> FROM dfs.test.`browsing.h5` AS t1 WHERE t1.attributes.important = false;
>> +---------+-----------+-------------+
>> |  path   | data_type |  file_name  |
>> +---------+-----------+-------------+
>> | /groupB | GROUP     | browsing.h5 |
>> +---------+-----------+-------------+}}
>> h3. Known Limitations
>> h3. 
>> There are several limitations with the HDF5 format plugin in Drill.
>> * Drill cannot read unsigned 64 bit integers. When the plugin encounters 
>> this data type, it will write an INFO message to the log.
>> * Drill cannot read compressed fields in HDF5 files.
>> * HDF5 files can contain nested data sets of up to n dimensions. Since Drill 
>> works best with two dimensional data, datasets with more than two dimensions 
>> are flattened.
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)

Re: [jira] [Updated] (DRILL-7233) Format Plugin for HDF5

Reply via email to