[
https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246159#comment-15246159
]
Jesse Yates commented on DRILL-4615:
------------------------------------
I imagine this can be handled with an optional flag and a column/field
separator, which seems easy enough to slide in. However, I'm not terribly
familiar with the Drill code, so any pointers as to where to start would be
great.
It seems like the ParquetGroupScan is already too late in the pipeline, but I'm
not sure where else we can put this.
> Support directory names in schema
> ---------------------------------
>
> Key: DRILL-4615
> URL: https://issues.apache.org/jira/browse/DRILL-4615
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
> /column2=hello
> /data.parquet
> /column2=world
> /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the
> directories as strings when what they really are is column names + values. In
> the data files we only have the remaining columns. Querying this with drill
> means that you can really only have a couple of data types (far short of what
> spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet
> files (especially as they are being periodically updated).
> I think this ends up being a nice addition for general file directory reads
> as well since many people already encode meaning into their directory
> structure, but having self describing directories is even better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)