Hyukjin Kwon created SPARK-20364:
------------------------------------

             Summary: Parquet predicate pushdown on columns with dots return 
empty results
                 Key: SPARK-20364
                 URL: https://issues.apache.org/jira/browse/SPARK-20364
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
            Reporter: Hyukjin Kwon


Currently, if there are dots in the column name, predicate pushdown seems being 
failed in Parquet.

**With dots**

{code}
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
{code}

{code}
+--------+
|col.dots|
+--------+
+--------+
{code}

**Without dots**

{code}
val path = "/tmp/abcde2"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
{code}

{code}
+-------+
|coldots|
+-------+
|      1|
+-------+
{code}

It seems dot in the column names via {{FilterApi}} tries to separate the field 
name with dot ({{ColumnPath}} with multiple column paths) whereas the actual 
column name is {{col.dots}}. (See [FilterApi.java#L71 
|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
 and it calls 
[ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].

I just tried to come up with ways to resolve it and I came up with two as below:

One is simply to don't push down filters when there are dots in column names so 
that it reads all and filters in Spark-side.

The other way creates Spark's {{FilterApi}} for those columns (it seems final) 
to get always use single column path it in Spark-side (this seems hacky) as we 
are not pushing down nested columns currently. So, it looks we can get a field 
name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} in this way.

I just made a rough version of the latter. 

{code}
private[parquet] object ParquetColumns {
  def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
    new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
SupportsLtGt
  }

  def longColumn(columnPath: String): Column[java.lang.Long] with SupportsLtGt 
= {
    new Column[java.lang.Long] (
      ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
  }

  def floatColumn(columnPath: String): Column[java.lang.Float] with 
SupportsLtGt = {
    new Column[java.lang.Float] (
      ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
  }

  def doubleColumn(columnPath: String): Column[java.lang.Double] with 
SupportsLtGt = {
    new Column[java.lang.Double] (
      ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
  }

  def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
SupportsEqNotEq = {
    new Column[java.lang.Boolean] (
      ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
SupportsEqNotEq
  }

  def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
    new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
SupportsLtGt
  }
}
{code}

Both sound not the best. Please let me know if anyone has a better idea.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to