[
https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248208#comment-15248208
]
Jesse Yates commented on DRILL-4615:
------------------------------------
You got it exactly [~sphillips]. I'd definitely be up for an attempt. I think I
see where we do the column/dir filtering in ParquetScanBatchCreator#getBatch,
but FileSystemPartitionDescriptor seems a bit more vague - is it
#createPartitionSublists or in #populatePartitionVectors? It seems like
PartitionLocation should be the point of abstraction. Right now, the
DFSPartitionLocation just reads the dir[index] and the ParquetPartitionLocation
throws an exception, so I'm not sure how its all wired together.
Any hints would be appreciated!
> Support directory names in schema
> ---------------------------------
>
> Key: DRILL-4615
> URL: https://issues.apache.org/jira/browse/DRILL-4615
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
> /column2=hello
> /data.parquet
> /column2=world
> /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the
> directories as strings when what they really are is column names + values. In
> the data files we only have the remaining columns. Querying this with drill
> means that you can really only have a couple of data types (far short of what
> spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet
> files (especially as they are being periodically updated).
> I think this ends up being a nice addition for general file directory reads
> as well since many people already encode meaning into their directory
> structure, but having self describing directories is even better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)