Re: parquet predicate / projection pushdown into unionAll

Patrick Wendell Tue, 09 Sep 2014 12:09:55 -0700

I think what Michael means is people often use this to read existing
partitioned Parquet tables that are defined in a Hive metastore rather
than data generated directly from within Spark and then reading it
back as a table. I'd expect the latter case to become more common, but
for now most users connect to an existing metastore.


I think you could go this route by creating a partitioned external
table based on the on-disk layout you create. The downside is that
you'd have to go through a hive metastore whereas what you are doing
now doesn't need hive at all.

We should also just fix the case you are mentioning where a union is
used directly from within spark. But that's the context.

- Patrick

On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger <c...@koeninger.org> wrote:
> Maybe I'm missing something, I thought parquet was generally a write-once
> format and the sqlContext interface to it seems that way as well.
>
> d1.saveAsParquetFile("/foo/d1")
>
> // another day, another table, with same schema
> d2.saveAsParquetFile("/foo/d2")
>
> Will give a directory structure like
>
> /foo/d1/_metadata
> /foo/d1/part-r-1.parquet
> /foo/d1/part-r-2.parquet
> /foo/d1/_SUCCESS
>
> /foo/d2/_metadata
> /foo/d2/part-r-1.parquet
> /foo/d2/part-r-2.parquet
> /foo/d2/_SUCCESS
>
> // ParquetFileReader will fail, because /foo/d1 is a directory, not a
> parquet partition
> sqlContext.parquetFile("/foo")
>
> // works, but has the noted lack of pushdown
> sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2"))
>
>
> Is there another alternative?
>
>
>
> On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> I think usually people add these directories as multiple partitions of the
>> same table instead of union.  This actually allows us to efficiently prune
>> directories when reading in addition to standard column pruning.
>>
>> On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf <malouf.g...@gmail.com>
>> wrote:
>>
>>> I'm kind of surprised this was not run into before.  Do people not
>>> segregate their data by day/week in the HDFS directory structure?
>>>
>>>
>>> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>>> Thanks!
>>>>
>>>> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>> > Opened
>>>> >
>>>> > https://issues.apache.org/jira/browse/SPARK-3462
>>>> >
>>>> > I'll take a look at ColumnPruning and see what I can do
>>>> >
>>>> > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust <
>>>> mich...@databricks.com>
>>>> > wrote:
>>>> >
>>>> >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger <c...@koeninger.org>
>>>> >> wrote:
>>>> >>>
>>>> >>> Is there a reason in general not to push projections and predicates
>>>> down
>>>> >>> into the individual ParquetTableScans in a union?
>>>> >>>
>>>> >>
>>>> >> This would be a great case to add to ColumnPruning.  Would be awesome
>>>> if
>>>> >> you could open a JIRA or even a PR :)
>>>> >>
>>>> >
>>>> >
>>>>
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: parquet predicate / projection pushdown into unionAll

Reply via email to