This PR promotes the `druid-parquet-extensions` module from 'contrib' to 'core'
and introduces a new hadoop parser that is not based on converting to avro
first, instead using the `SimpleGroup` based reference implementation of the
`parquet-column` package of `parquet-mr`. This is likely not be the best or
most efficient way to parse and convert parquet files... but its raw structure
suited my needs of supporting converting `int96` timestamp columns into longs
(for #5150) and additionally provide the ability to support a `flattenSpec`.
changes:
* `druid-parquet-extensions` now provides 2 types of hadoop parsers, `parquet`
and `parquet-avro`, which use
`org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` and
`org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` hadoop
input formats respectively.
* `parquet` and `parquet-avro` parsers now both support `flattenSpec` by
specifying `parquet` and `avro` in the `parseSpec` respectively. `parquet-avro`
re-uses the `druid-avro-extensions` spec and flattener. There may be minor
behavior differences in how parquet logical types are handled.
* extracted abstract type `NestedDataParseSpec<TFlattenSpec>` for ParseSpecs
which support a `flattenSpec` property, used by `JSONParseSpec`,
`AvroParseSpec`, and `ParquetParseSpec` (also introduced in this PR)
* lightly modified behavior of `avro` flattener auto field discovery to be more
discerning about arrays (only primitive arrays are now considered) and to allow
nullable primitive fields to be picked up. The array thing might need to be
called out, since previously it would have the `toString` array contents of
complex types, which I don't think is correct behavior, but could trip up
anyone relying on that to happen.
* adds many tests and parquet test files ("donated" from [spark-sql tests
here](https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data))
to ensure conversion correctness
On top of all of the added tests, I've lightly tested both parsers on a local
druid/hadoop cluster on my laptop.
Fixes #5150 (i hope)
[ Full content available at:
https://github.com/apache/incubator-druid/pull/6360 ]
This message was relayed via gitbox.apache.org for [email protected]