[GitHub] [incubator-druid] clintropolis opened pull request #6360: overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core'

GitHub Thu, 20 Sep 2018 15:56:34 -0700

This PR promotes the `druid-parquet-extensions` module from 'contrib' to 'core' 
and introduces a new hadoop parser that is not based on converting to avro 
first, instead using the `SimpleGroup` based reference implementation of the 
`parquet-column` package of `parquet-mr`. This is likely not be the best or 
most efficient way to parse and convert parquet files... but its raw structure 
suited my needs of supporting converting `int96` timestamp columns into longs 
(for #5150) and additionally provide the ability to support a `flattenSpec`.


changes:
* `druid-parquet-extensions` now provides 2 types of hadoop parsers, `parquet` 
and `parquet-avro`, which use 
`org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` and 
`org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` hadoop 
input formats respectively. 
* `parquet` and `parquet-avro` parsers now both support `flattenSpec` by 
specifying `parquet` and `avro` in the `parseSpec` respectively. `parquet-avro` 
re-uses the `druid-avro-extensions` spec and flattener. There may be minor 
behavior differences in how parquet logical types are handled.
* extracted abstract type `NestedDataParseSpec<TFlattenSpec>` for ParseSpecs 
which support a `flattenSpec` property, used by `JSONParseSpec`, 
`AvroParseSpec`, and `ParquetParseSpec` (also introduced in this PR)
* lightly modified behavior of `avro` flattener auto field discovery to be more 
discerning about arrays (only primitive arrays are now considered) and to allow 
nullable primitive fields to be picked up. The array thing might need to be 
called out, since previously it would have the `toString` array contents of 
complex types, which I don't think is correct behavior, but could trip up 
anyone relying on that to happen.
* adds many tests and parquet test files ("donated" from [spark-sql tests 
here](https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data))
 to ensure conversion correctness

On top of all of the added tests, I've lightly tested both parsers on a local 
druid/hadoop cluster on my laptop. 

Fixes #5150 (i hope)

[ Full content available at: 
https://github.com/apache/incubator-druid/pull/6360 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-druid] clintropolis opened pull request #6360: overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core'

Reply via email to