[
https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248163#comment-15248163
]
Steven Phillips commented on DRILL-4615:
----------------------------------------
It seems what you are describing is an alternative way of interpreting
directory attributes. Drill's current approach is to create the columns dir0,
dir1, etc, which contain the string value of the directory names. These column
names and values are currently used in two different places in drill. The first
is for partition pruning during the planning stage, and then in the columns are
materialized during the actual execution of the scan. You can see examples of
these uses in the classes: FileSystemPartitionDescriptor, and
ParquetScanBatchCreator.
We should probably refactor and make abstract the code which materializes the
partition column names and values into some sort of Attribute Provider, and
then we could implement an alternate version which interprets the directories
the way Spark and Hive do.
If this is something you are interested in working on, I can help out.
> Support directory names in schema
> ---------------------------------
>
> Key: DRILL-4615
> URL: https://issues.apache.org/jira/browse/DRILL-4615
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
> /column2=hello
> /data.parquet
> /column2=world
> /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the
> directories as strings when what they really are is column names + values. In
> the data files we only have the remaining columns. Querying this with drill
> means that you can really only have a couple of data types (far short of what
> spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet
> files (especially as they are being periodically updated).
> I think this ends up being a nice addition for general file directory reads
> as well since many people already encode meaning into their directory
> structure, but having self describing directories is even better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)