[jira] [Commented] (DRILL-4615) Support directory names in schema

Jesse Yates (JIRA) Tue, 19 Apr 2016 10:13:46 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248208#comment-15248208
 ]


Jesse Yates commented on DRILL-4615:
------------------------------------

You got it exactly [~sphillips]. I'd definitely be up for an attempt. I think I 
see where we do the column/dir filtering in ParquetScanBatchCreator#getBatch, 
but FileSystemPartitionDescriptor seems a bit more vague - is it 
#createPartitionSublists or in #populatePartitionVectors? It seems like 
PartitionLocation should be the point of abstraction. Right now, the 
DFSPartitionLocation just reads the dir[index] and the ParquetPartitionLocation 
throws an exception, so I'm not sure how its all wired together.

Any hints would be appreciated!

> Support directory names in schema
> ---------------------------------
>
>                 Key: DRILL-4615
>                 URL: https://issues.apache.org/jira/browse/DRILL-4615
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Jesse Yates
>
> In Spark, partitioned parquet output is written with directories like:
> {code}
> /column1=1
>   /column2=hello
>      /data.parquet
>   /column2=world
>      /moredata.parquet
> /column1=2
> {code}
> However, when querying these files with Drill we end up interpreting the 
> directories as strings when what they really are is column names + values. In 
> the data files we only have the remaining columns. Querying this with drill 
> means that you can really only have a couple of data types (far short of what 
> spark/parquet supports) in the column and still have correct operations.
> Given the size of the data, I don't want to have to CTAS all the parquet 
> files (especially as they are being periodically updated). 
> I think this ends up being a nice addition for general file directory reads 
> as well since many people already encode meaning into their directory 
> structure, but having self describing directories is even better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4615) Support directory names in schema

Reply via email to