[GitHub] clintropolis opened a new pull request #6360: overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core'

GitBox Thu, 20 Sep 2018 15:56:35 -0700

clintropolis opened a new pull request #6360: overhaul 
'druid-parquet-extensions' module, promoting from 'contrib' to 'core'
URL: https://github.com/apache/incubator-druid/pull/6360
 
 
   This PR promotes the `druid-parquet-extensions` module from 'contrib' to 
'core' and introduces a new hadoop parser that is not based on converting to 
avro first, instead using the `SimpleGroup` based reference implementation of 
the `parquet-column` package of `parquet-mr`. This is likely not be the best or 
most efficient way to parse and convert parquet files... but its raw structure 
suited my needs of supporting converting `int96` timestamp columns into longs 
(for #5150) and additionally provide the ability to support a `flattenSpec`.
   
   changes:
   * `druid-parquet-extensions` now provides 2 types of hadoop parsers, 
`parquet` and `parquet-avro`, which use 
`org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` and 
`org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` hadoop 
input formats respectively. 
   * `parquet` and `parquet-avro` parsers now both support `flattenSpec` by 
specifying `parquet` and `avro` in the `parseSpec` respectively. `parquet-avro` 
re-uses the `druid-avro-extensions` spec and flattener. There may be minor 
behavior differences in how parquet logical types are handled.
   * extracted abstract type `NestedDataParseSpec<TFlattenSpec>` for ParseSpecs 
which support a `flattenSpec` property, used by `JSONParseSpec`, 
`AvroParseSpec`, and `ParquetParseSpec` (also introduced in this PR)
   * lightly modified behavior of `avro` flattener auto field discovery to be 
more discerning about arrays (only primitive arrays are now considered) and to 
allow nullable primitive fields to be picked up. The array thing might need to 
be called out, since previously it would have the `toString` array contents of 
complex types, which I don't think is correct behavior, but could trip up 
anyone relying on that to happen.
   * adds many tests and parquet test files ("donated" from [spark-sql tests 
here](https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data))
 to ensure conversion correctness
   
   On top of all of the added tests, I've lightly tested both parsers on a 
local druid/hadoop cluster on my laptop. 
   
   Fixes #5150 (i hope)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] clintropolis opened a new pull request #6360: overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core'

Reply via email to