On 7/22/15 9:03 AM, Ankit wrote:

Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs were treated as Strings in Spark SQL right? So does this mean partitioning for enums already works in previous versions too since they are just treated as strings?

It’s a little bit complicated. A Thrift/Avro/ProtoBuf |ENUM| value is represented as a |BINARY| annotated with original type |ENUM| in Parquet. For example, an optional |ENUM| field |e| is translated to something like |optional BINARY e (ENUM)| in Parquet. And the underlying data is always a UTF8 string of the |ENUM| name. However, the Parquet original type |ENUM| is not documented, thus Spark 1.3 and 1.4 doesn’t recognize the |ENUM| annotation and just see it as a normal |BINARY|. (I didn’t even notice the existence of |ENUM| in Parquet before PR #7048…)

On the other hand, Spark SQL has a boolean option named |spark.sql.parquet.binaryAsString|. When this option is set to |true|, all Parquet |BINARY| values are considered and converted to UTF8 strings. The original purpose of this option is used to work around a bug of Hive, which writes strings as plain Parquet |BINARY| values without a proper |UTF8| annotation.

That said, by using |sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")| in Scala/Java/Python, or |SET spark.sql.parquet.binaryAsString=true| in SQL, you may read those |ENUM| values as plain UTF8 strings.


Also, is there a good way to verify that the partitioning is being used? I tried "explain" like (where data is partitioned by "type" column)

scala> ev.filter("type = 'NON'").explain
== Physical Plan ==
Filter (type#4849 = NON)
PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at newParquet.scala:573

but that is the same even with non partitioned data.

Do you mean how to verify whether partition pruning is effective? You should be able to see log lines like this:

   15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out
   of 3, pruned 66.66666666666667% partitions.



On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added
    to the master branch. https://github.com/apache/spark/pull/7048

    ENUM types are actually not in the Parquet format spec, that's why
    we didn't have it at the first place. Basically, ENUMs are always
    treated as UTF8 strings in Spark SQL now.

    Cheng

    On 7/22/15 3:41 AM, ankits wrote:

        Hi, I am using a custom build of spark 1.4 with the parquet
        dependency
        upgraded to 1.7. I have thrift data encoded with parquet that
        i want to
        partition by a column of type ENUM. Spark programming guide
        says partition
        discovery is only supported for string and numeric columns, so
        it seems
        partition discovery won't work out of the box here.

        Is there any workaround that will allow me to partition by
        ENUMs? Will hive
        partitioning help here? I am unfamiliar with Hive, and how it
        plays into
        parquet, thrift and spark so I would appreciate any pointers
        in the right
        direction. Thanks.



        --
        View this message in context:
        
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
        Sent from the Apache Spark User List mailing list archive at
        Nabble.com.

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>




Reply via email to