[jira] [Commented] (DRILL-4615) Support directory names in schema

Steven Phillips (JIRA) Tue, 19 Apr 2016 09:56:24 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248163#comment-15248163
 ]


Steven Phillips commented on DRILL-4615:
----------------------------------------

It seems what you are describing is an alternative way of interpreting 
directory attributes. Drill's current approach is to create the columns dir0, 
dir1, etc, which contain the string value of the directory names. These column 
names and values are currently used in two different places in drill. The first 
is for partition pruning during the planning stage, and then in the columns are 
materialized during the actual execution of the scan. You can see examples of 
these uses in the classes: FileSystemPartitionDescriptor, and 
ParquetScanBatchCreator.

We should probably refactor and make abstract the code which materializes the 
partition column names and values into some sort of Attribute Provider, and 
then we could implement an alternate version which interprets the directories 
the way Spark and Hive do.

If this is something you are interested in working on, I can help out.

> Support directory names in schema
> ---------------------------------
>
>                 Key: DRILL-4615
>                 URL: https://issues.apache.org/jira/browse/DRILL-4615
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
>   /column2=hello
>      /data.parquet
>   /column2=world
>      /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the 
> directories as strings when what they really are is column names + values. In 
> the data files we only have the remaining columns. Querying this with drill 
> means that you can really only have a couple of data types (far short of what 
> spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet 
> files (especially as they are being periodically updated). 
> I think this ends up being a nice addition for general file directory reads 
> as well since many people already encode meaning into their directory 
> structure, but having self describing directories is even better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4615) Support directory names in schema

Reply via email to