[ https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873021#comment-15873021 ]
Takeshi Yamamuro edited comment on SPARK-19638 at 2/18/17 6:52 AM: ------------------------------------------------------------------- Aha, I got you and you're right; in that case, catalyst does not push down such a condition. {code} scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", "b").write.parquet("/Users/maropu/Desktop/data") scala> val df = spark.read.load("/Users/maropu/Desktop/data") df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>] scala> df.where($"a" === 1).explain == Physical Plan == *Project [a#108, b#109] +- *Filter (isnotnull(a#108) && (a#108 = 1)) +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: struct<a:int,b:struct<_1:string,_2:double>> scala> df.where($"b._1" === "b").explain == Physical Plan == *Filter (b#109._1 = b) +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:struct<_1:string,_2:double>> {code} BTW, this is not a bug, but improvement in the Type because this kind of queries does not return incorrect results. was (Author: maropu): Aha, I got you and you're right; in that case, catalyst does not push down such a condition. {code} scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", "b").write.parquet("/Users/maropu/Desktop/data") scala> val df = spark.read.load("/Users/maropu/Desktop/data") df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>] scala> df.where($"a" === 1).explain == Physical Plan == *Project [a#108, b#109] +- *Filter (isnotnull(a#108) && (a#108 = 1)) +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: struct<a:int,b:struct<_1:string,_2:double>> scala> df.where($"b._1" === "b").explain == Physical Plan == *Filter (b#109._1 = b) +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:struct<_1:string,_2:double>> {code} BTW, this is not a bug, but improvement in the Type because this kind of queries does not return incorrect results. > Filter pushdown not working for struct fields > --------------------------------------------- > > Key: SPARK-19638 > URL: https://issues.apache.org/jira/browse/SPARK-19638 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Nick Dimiduk > > Working with a dataset containing struct fields, and enabling debug logging > in the ES connector, I'm seeing the following behavior. The dataframe is > created over the ES connector and then the schema is extended with a couple > column aliases, such as. > {noformat} > df.withColumn("f2", df("foo")) > {noformat} > Queries vs those alias columns work as expected for fields that are > non-struct members. > {noformat} > scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show > 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters > [IsNotNull(foo),EqualTo(foo,1)] > 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL > [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}] > {noformat} > However, try the same with an alias over a struct field, and no filters are > pushed down. > {noformat} > scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == > '1'").limit(1).show > {noformat} > In fact, this is the case even when no alias is used at all. > {noformat} > scala> df.where("bar.baz == '1'").limit(1).show > {noformat} > Basically, pushdown for structs doesn't work at all. > Maybe this is specific to the ES connector? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org