[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875188#comment-15875188
 ] 

Nick Dimiduk commented on SPARK-17636:
--------------------------------------

I've proposed a PR for fixing this issue. Using [~maropu]'s suggested testing 
strategy via explain plans with parquet, with the proposed patch applied, I see 
the following

Before:

{noformat}
scala> df.where($"b._1" === "").explain
== Physical Plan ==
*Filter (b#1._1 = )
+- *FileScan parquet [a#0,b#1] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/ndimiduk/tmp/spark_19638], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<a:int,b:struct<_1:string,_2:double>>
{noformat}

After:

{noformat}
scala> df.where($"b._1" === "").explain
== Physical Plan ==
*Filter (b._1 = )
+- *FileScan parquet [a#283,b#284] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/ndimiduk/tmp/spark_19638], PartitionFilters: [], 
PushedFilters: [EqualTo(b._1,)], ReadSchema: 
struct<a:int,b:struct<_1:string,_2:double>>
{noformat}

> Parquet filter push down doesn't handle struct fields
> -----------------------------------------------------
>
>                 Key: SPARK-17636
>                 URL: https://issues.apache.org/jira/browse/SPARK-17636
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.2, 1.6.3, 2.0.2
>            Reporter: Mitesh
>            Priority: Minor
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct<timestamp:bigint,timezone:string>, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to