[GitHub] [drill] cgivre opened a new pull request #1778: Drill-7233: Format Plugin for HDF5

GitBox Thu, 02 May 2019 05:26:46 -0700

cgivre opened a new pull request #1778: Drill-7233: Format Plugin for HDF5
URL: https://github.com/apache/drill/pull/1778
 
 
   # Drill HDF5 Format Plugin
   Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats 
designed to store and organize large amounts of data. Originally developed at 
the National Center for Supercomputing Applications, it is supported by The HDF 
Group, a non-profit corporation whose mission is to ensure continued 
development of HDF5 technologies and the continued accessibility of data stored 
in HDF.
   
   This plugin enables Apache Drill to query HDF5 files.  
   
   ## Configuration
   There are three configuration variables in this plugin:
   * `type`:  This should be set to `hdf5`.
   * `extensions`:  This is a list of the file extensions used to identify HDF5 
files.  Typically HDF5 uses `.h5` or `.hdf5` as file extensions.  This defaults 
to `.h5`.
   * `defaultPath`:  
   
   ### Example Configuration
   For most uses, the configuration below will suffice to enable Drill to query 
HDF5 files.
   ```
   "hdf5": {
         "type": "hdf5",
         "extensions": [
           "h5"
         ],
         "defaultPath": null
       }
   ```
   ## Usage
   Since HDF5 can be viewed as a file system within a file, a single file can 
contain many datasets.  For instance, if you have a simple HDF5 file, a star 
query will produce the following result:
   ```
   apache drill> select * from dfs.test.`dset.h5`;
   
+-------+-----------+-----------+--------------------------------------------------------------------------+
   | path  | data_type | file_name |                                 int_data   
                              |
   
+-------+-----------+-----------+--------------------------------------------------------------------------+
   | /dset | DATASET   | dset.h5   | 
[[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
   
+-------+-----------+-----------+--------------------------------------------------------------------------+
   ```
   The actual data in this file is mapped to a column called int_data.  In 
order to effectively access the data, you should use Drill's `FLATTEN()` 
function on the `int_data` column, which produces the following result.
   
   ```
   apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`;
   +---------------------+
   |      int_data       |
   +---------------------+
   | [1,2,3,4,5,6]       |
   | [7,8,9,10,11,12]    |
   | [13,14,15,16,17,18] |
   | [19,20,21,22,23,24] |
   +---------------------+
   ```
   Once you have the data in this form, you can access it similarly to how you 
might access nested data in JSON or other files. 
   
   ```
   apache drill> SELECT int_data[0] as col_0,
   . .semicolon> int_data[1] as col_1,
   . .semicolon> int_data[2] as col_2
   . .semicolon> FROM ( SELECT flatten(int_data) AS int_data
   . . . . . .)> FROM dfs.test.`dset.h5`
   . . . . . .)> );
   +-------+-------+-------+
   | col_0 | col_1 | col_2 |
   +-------+-------+-------+
   | 1     | 2     | 3     |
   | 7     | 8     | 9     |
   | 13    | 14    | 15    |
   | 19    | 20    | 21    |
   +-------+-------+-------+
   ```
   
   Alternatively, a better way to query the actual data in an HDF5 file is to 
use the `defaultPath` field in your query.  If the `defaultPath` field is 
defined in the query, or via
    the plugin configuration, Drill will only return the data, rather than the 
file metadata.
    
    ** Note: Once you have determined which data set you are querying, it is 
advisable to use this method to query HDF5 data. **
    
    You can set the `defaultPath` variable in either the plugin configuration, 
or at query time using the `table()` function as shown in the example below:
    
    ```
   SELECT * 
   FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))
   ```
    This query will return the result below:
    
    ```
    apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', 
defaultPath => '/dset'));
    +-----------+-----------+-----------+-----------+-----------+-----------+
    | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
    +-----------+-----------+-----------+-----------+-----------+-----------+
    | 1         | 2         | 3         | 4         | 5         | 6         |
    | 7         | 8         | 9         | 10        | 11        | 12        |
    | 13        | 14        | 15        | 16        | 17        | 18        |
    | 19        | 20        | 21        | 22        | 23        | 24        |
    +-----------+-----------+-----------+-----------+-----------+-----------+
    4 rows selected (0.223 seconds)
   
   ```
   
   If the data in `defaultPath` is a column, the column name will be the last 
part of the path.  If the data is multidimensional, the columns will get a name 
of `<data_type>_col_n`
   .  Therefore a column of integers will be called `int_col_1`.  
   
   ### Attributes
   Occasionally, HDF5 paths will contain attributes.  Drill will map these to a 
map data structure called `attributes`, as shown in the query below.
   ```
   apache drill> SELECT attributes FROM dfs.test.`browsing.h5`;
   
+----------------------------------------------------------------------------------+
   |                                    attributes                              
      |
   
+----------------------------------------------------------------------------------+
   | {}                                                                         
      |
   | {"__TYPE_VARIANT__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH"}     
      |
   | {}                                                                         
      |
   | {}                                                                         
      |
   | 
{"important":false,"__TYPE_VARIANT__timestamp__":"TIMESTAMP_MILLISECONDS_SINCE_START_OF_THE_EPOCH","timestamp":1550033296762}
 |
   | {}                                                                         
      |
   | {}                                                                         
      |
   | {}                                                                         
      |
   
+----------------------------------------------------------------------------------+
   8 rows selected (0.292 seconds)
   ```
   You can access the individual fields within the `attributes`  map by using 
the structure `table.map.key`.  Note that you will have to give the table an 
alias for this to work properly.  
   ```
   apache drill> SELECT path, data_type, file_name  
   FROM dfs.test.`browsing.h5` AS t1 WHERE t1.attributes.important = false;
   +---------+-----------+-------------+
   |  path   | data_type |  file_name  |
   +---------+-----------+-------------+
   | /groupB | GROUP     | browsing.h5 |
   +---------+-----------+-------------+
   ```
   
   ### Known Limitations
   There are several limitations with the HDF5 format plugin in Drill.
   * Drill cannot read unsigned 64 bit integers.  When the plugin encounters 
this data type, it will write an INFO message to the log.
   * Drill cannot read compressed fields in HDF5 files.
   * HDF5 files can contain nested data sets of up to `n` dimensions.  Since 
Drill works best with two dimensional data, datasets with more than two 
dimensions are flattened.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] cgivre opened a new pull request #1778: Drill-7233: Format Plugin for HDF5

Reply via email to