clintropolis opened a new pull request #6360: overhaul 'druid-parquet-extensions' module, promoting from 'contrib' to 'core' URL: https://github.com/apache/incubator-druid/pull/6360 This PR promotes the `druid-parquet-extensions` module from 'contrib' to 'core' and introduces a new hadoop parser that is not based on converting to avro first, instead using the `SimpleGroup` based reference implementation of the `parquet-column` package of `parquet-mr`. This is likely not be the best or most efficient way to parse and convert parquet files... but its raw structure suited my needs of supporting converting `int96` timestamp columns into longs (for #5150) and additionally provide the ability to support a `flattenSpec`. changes: * `druid-parquet-extensions` now provides 2 types of hadoop parsers, `parquet` and `parquet-avro`, which use `org.apache.druid.data.input.parquet.simple.DruidParquetInputFormat` and `org.apache.druid.data.input.parquet.avro.DruidParquetAvroInputFormat` hadoop input formats respectively. * `parquet` and `parquet-avro` parsers now both support `flattenSpec` by specifying `parquet` and `avro` in the `parseSpec` respectively. `parquet-avro` re-uses the `druid-avro-extensions` spec and flattener. There may be minor behavior differences in how parquet logical types are handled. * extracted abstract type `NestedDataParseSpec<TFlattenSpec>` for ParseSpecs which support a `flattenSpec` property, used by `JSONParseSpec`, `AvroParseSpec`, and `ParquetParseSpec` (also introduced in this PR) * lightly modified behavior of `avro` flattener auto field discovery to be more discerning about arrays (only primitive arrays are now considered) and to allow nullable primitive fields to be picked up. The array thing might need to be called out, since previously it would have the `toString` array contents of complex types, which I don't think is correct behavior, but could trip up anyone relying on that to happen. * adds many tests and parquet test files ("donated" from [spark-sql tests here](https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data)) to ensure conversion correctness On top of all of the added tests, I've lightly tested both parsers on a local druid/hadoop cluster on my laptop. Fixes #5150 (i hope)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
