[
https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arina Ielchiieva updated DRILL-7233:
------------------------------------
Labels: doc-impacting ready-to-commit (was: doc-impacting)
> Format Plugin for HDF5
> ----------------------
>
> Key: DRILL-7233
> URL: https://issues.apache.org/jira/browse/DRILL-7233
> Project: Apache Drill
> Issue Type: New Feature
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Labels: doc-impacting, ready-to-commit
> Fix For: 1.18.0
>
>
> h2. Drill HDF5 Format Plugin
> h2.
> Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats
> designed to store and organize large amounts of data. Originally developed at
> the National Center for Supercomputing Applications, it is supported by The
> HDF Group, a non-profit corporation whose mission is to ensure continued
> development of HDF5 technologies and the continued accessibility of data
> stored in HDF.
> This plugin enables Apache Drill to query HDF5 files.
> h3. Configuration
> There are three configuration variables in this plugin:
> type: This should be set to hdf5.
> extensions: This is a list of the file extensions used to identify HDF5
> files. Typically HDF5 uses .h5 or .hdf5 as file extensions. This defaults to
> .h5.
> defaultPath:
> h3. Example Configuration
> h3.
> For most uses, the configuration below will suffice to enable Drill to query
> HDF5 files.
> {{"hdf5": {
> "type": "hdf5",
> "extensions": [
> "h5"
> ],
> "defaultPath": null
> }}}
> h3. Usage
> Since HDF5 can be viewed as a file system within a file, a single file can
> contain many datasets. For instance, if you have a simple HDF5 file, a star
> query will produce the following result:
> {{apache drill> select * from dfs.test.`dset.h5`;
> +-------+-----------+-----------+--------------------------------------------------------------------------+
> | path | data_type | file_name | int_data
> |
> +-------+-----------+-----------+--------------------------------------------------------------------------+
> | /dset | DATASET | dset.h5 |
> [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
> +-------+-----------+-----------+--------------------------------------------------------------------------+}}
> The actual data in this file is mapped to a column called int_data. In order
> to effectively access the data, you should use Drill's FLATTEN() function on
> the int_data column, which produces the following result.
> {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`;
> +---------------------+
> | int_data |
> +---------------------+
> | [1,2,3,4,5,6] |
> | [7,8,9,10,11,12] |
> | [13,14,15,16,17,18] |
> | [19,20,21,22,23,24] |
> +---------------------+}}
> Once you have the data in this form, you can access it similarly to how you
> might access nested data in JSON or other files.
> {{apache drill> SELECT int_data[0] as col_0,
> . .semicolon> int_data[1] as col_1,
> . .semicolon> int_data[2] as col_2
> . .semicolon> FROM ( SELECT flatten(int_data) AS int_data
> . . . . . .)> FROM dfs.test.`dset.h5`
> . . . . . .)> );
> +-------+-------+-------+
> | col_0 | col_1 | col_2 |
> +-------+-------+-------+
> | 1 | 2 | 3 |
> | 7 | 8 | 9 |
> | 13 | 14 | 15 |
> | 19 | 20 | 21 |
> +-------+-------+-------+}}
> Alternatively, a better way to query the actual data in an HDF5 file is to
> use the defaultPath field in your query. If the defaultPath field is defined
> in the query, or via the plugin configuration, Drill will only return the
> data, rather than the file metadata.
> ** Note: Once you have determined which data set you are querying, it is
> advisable to use this method to query HDF5 data. **
> You can set the defaultPath variable in either the plugin configuration, or
> at query time using the table() function as shown in the example below:
> {{SELECT *
> FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}}
> This query will return the result below:
> {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5',
> defaultPath => '/dset'));
> +-----------+-----------+-----------+-----------+-----------+-----------+
> | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
> +-----------+-----------+-----------+-----------+-----------+-----------+
> | 1 | 2 | 3 | 4 | 5 | 6 |
> | 7 | 8 | 9 | 10 | 11 | 12 |
> | 13 | 14 | 15 | 16 | 17 | 18 |
> | 19 | 20 | 21 | 22 | 23 | 24 |
> +-----------+-----------+-----------+-----------+-----------+-----------+
> 4 rows selected (0.223 seconds)}}
> If the data in defaultPath is a column, the column name will be the last part
> of the path. If the data is multidimensional, the columns will get a name of
> <data_type>_col_n . Therefore a column of integers will be called int_col_1.
> h3. Attributes
> Occasionally, HDF5 paths will contain attributes. Drill will map these to a
> map data structure called attributes, as shown in the query below.
> {{apache drill> SELECT attributes FROM dfs.test.`browsing.h5`;
> +----------------------------------------------------------------------------------+
> | attributes
> |
> +----------------------------------------------------------------------------------+
> | {}
> |
> | {"__TYPE_VARIANT__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH"}
> |
> | {}
> |
> | {}
> |
> |
> {"important":false,"__TYPE_VARIANT__timestamp__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH","timestamp":1550033296762}
> |
> | {}
> |
> | {}
> |
> | {}
> |
> +----------------------------------------------------------------------------------+
> 8 rows selected (0.292 seconds)}}
> You can access the individual fields within the attributes map by using the
> structure table.map.key. Note that you will have to give the table an alias
> for this to work properly.
> {{apache drill> SELECT path, data_type, file_name
> FROM dfs.test.`browsing.h5` AS t1 WHERE t1.attributes.important = false;
> +---------+-----------+-------------+
> | path | data_type | file_name |
> +---------+-----------+-------------+
> | /groupB | GROUP | browsing.h5 |
> +---------+-----------+-------------+}}
> h3. Known Limitations
> h3.
> There are several limitations with the HDF5 format plugin in Drill.
> * Drill cannot read unsigned 64 bit integers. When the plugin encounters this
> data type, it will write an INFO message to the log.
> * Drill cannot read compressed fields in HDF5 files.
> * HDF5 files can contain nested data sets of up to n dimensions. Since Drill
> works best with two dimensional data, datasets with more than two dimensions
> are flattened.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)