Since there's a blocking issue for Drill 1.17 and this is pretty close, could we attempt to get this into v 1.17? Thx, -- C
> On Nov 20, 2019, at 4:52 AM, Arina Ielchiieva (Jira) <[email protected]> wrote: > > > [ > https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Arina Ielchiieva updated DRILL-7233: > ------------------------------------ > Labels: doc-impacting (was: doc-impacting ready-to-commit) > >> Format Plugin for HDF5 >> ---------------------- >> >> Key: DRILL-7233 >> URL: https://issues.apache.org/jira/browse/DRILL-7233 >> Project: Apache Drill >> Issue Type: New Feature >> Affects Versions: 1.17.0 >> Reporter: Charles Givre >> Assignee: Charles Givre >> Priority: Major >> Labels: doc-impacting >> Fix For: 1.18.0 >> >> >> h2. Drill HDF5 Format Plugin >> h2. >> Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats >> designed to store and organize large amounts of data. Originally developed >> at the National Center for Supercomputing Applications, it is supported by >> The HDF Group, a non-profit corporation whose mission is to ensure continued >> development of HDF5 technologies and the continued accessibility of data >> stored in HDF. >> This plugin enables Apache Drill to query HDF5 files. >> h3. Configuration >> There are three configuration variables in this plugin: >> type: This should be set to hdf5. >> extensions: This is a list of the file extensions used to identify HDF5 >> files. Typically HDF5 uses .h5 or .hdf5 as file extensions. This defaults to >> .h5. >> defaultPath: >> h3. Example Configuration >> h3. >> For most uses, the configuration below will suffice to enable Drill to query >> HDF5 files. >> {{"hdf5": { >> "type": "hdf5", >> "extensions": [ >> "h5" >> ], >> "defaultPath": null >> }}} >> h3. Usage >> Since HDF5 can be viewed as a file system within a file, a single file can >> contain many datasets. For instance, if you have a simple HDF5 file, a star >> query will produce the following result: >> {{apache drill> select * from dfs.test.`dset.h5`; >> +-------+-----------+-----------+--------------------------------------------------------------------------+ >> | path | data_type | file_name | int_data >> | >> +-------+-----------+-----------+--------------------------------------------------------------------------+ >> | /dset | DATASET | dset.h5 | >> [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] | >> +-------+-----------+-----------+--------------------------------------------------------------------------+}} >> The actual data in this file is mapped to a column called int_data. In order >> to effectively access the data, you should use Drill's FLATTEN() function on >> the int_data column, which produces the following result. >> {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`; >> +---------------------+ >> | int_data | >> +---------------------+ >> | [1,2,3,4,5,6] | >> | [7,8,9,10,11,12] | >> | [13,14,15,16,17,18] | >> | [19,20,21,22,23,24] | >> +---------------------+}} >> Once you have the data in this form, you can access it similarly to how you >> might access nested data in JSON or other files. >> {{apache drill> SELECT int_data[0] as col_0, >> . .semicolon> int_data[1] as col_1, >> . .semicolon> int_data[2] as col_2 >> . .semicolon> FROM ( SELECT flatten(int_data) AS int_data >> . . . . . .)> FROM dfs.test.`dset.h5` >> . . . . . .)> ); >> +-------+-------+-------+ >> | col_0 | col_1 | col_2 | >> +-------+-------+-------+ >> | 1 | 2 | 3 | >> | 7 | 8 | 9 | >> | 13 | 14 | 15 | >> | 19 | 20 | 21 | >> +-------+-------+-------+}} >> Alternatively, a better way to query the actual data in an HDF5 file is to >> use the defaultPath field in your query. If the defaultPath field is defined >> in the query, or via the plugin configuration, Drill will only return the >> data, rather than the file metadata. >> ** Note: Once you have determined which data set you are querying, it is >> advisable to use this method to query HDF5 data. ** >> You can set the defaultPath variable in either the plugin configuration, or >> at query time using the table() function as shown in the example below: >> {{SELECT * >> FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}} >> This query will return the result below: >> {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', >> defaultPath => '/dset')); >> +-----------+-----------+-----------+-----------+-----------+-----------+ >> | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 | >> +-----------+-----------+-----------+-----------+-----------+-----------+ >> | 1 | 2 | 3 | 4 | 5 | 6 | >> | 7 | 8 | 9 | 10 | 11 | 12 | >> | 13 | 14 | 15 | 16 | 17 | 18 | >> | 19 | 20 | 21 | 22 | 23 | 24 | >> +-----------+-----------+-----------+-----------+-----------+-----------+ >> 4 rows selected (0.223 seconds)}} >> If the data in defaultPath is a column, the column name will be the last >> part of the path. If the data is multidimensional, the columns will get a >> name of <data_type>_col_n . Therefore a column of integers will be called >> int_col_1. >> h3. Attributes >> Occasionally, HDF5 paths will contain attributes. Drill will map these to a >> map data structure called attributes, as shown in the query below. >> {{apache drill> SELECT attributes FROM dfs.test.`browsing.h5`; >> +----------------------------------------------------------------------------------+ >> | attributes >> | >> +----------------------------------------------------------------------------------+ >> | {} >> | >> | {"__TYPE_VARIANT__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH"} >> | >> | {} >> | >> | {} >> | >> | >> {"important":false,"__TYPE_VARIANT__timestamp__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH","timestamp":1550033296762} >> | >> | {} >> | >> | {} >> | >> | {} >> | >> +----------------------------------------------------------------------------------+ >> 8 rows selected (0.292 seconds)}} >> You can access the individual fields within the attributes map by using the >> structure table.map.key. Note that you will have to give the table an alias >> for this to work properly. >> {{apache drill> SELECT path, data_type, file_name >> FROM dfs.test.`browsing.h5` AS t1 WHERE t1.attributes.important = false; >> +---------+-----------+-------------+ >> | path | data_type | file_name | >> +---------+-----------+-------------+ >> | /groupB | GROUP | browsing.h5 | >> +---------+-----------+-------------+}} >> h3. Known Limitations >> h3. >> There are several limitations with the HDF5 format plugin in Drill. >> * Drill cannot read unsigned 64 bit integers. When the plugin encounters >> this data type, it will write an INFO message to the log. >> * Drill cannot read compressed fields in HDF5 files. >> * HDF5 files can contain nested data sets of up to n dimensions. Since Drill >> works best with two dimensional data, datasets with more than two dimensions >> are flattened. > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005)
