Hi Cheng, I ran into issues related to ENUM when I tried to use Filter push down. I'm using Spark 1.5.0 (which contains fixes for parquet filter push down). The exception is the following:
java.lang.IllegalArgumentException: FilterPredicate column: item's declared type (org.apache.parquet.io.api.Binary) does not match the schema found in file metadata. Column item is of type: FullTypeDescriptor(PrimitiveType: BINARY, OriginalType: ENUM) Valid types for this column are: null Is it because Spark does not recognize ENUM type in parquet? Best Regards, Jerry On Wed, Jul 22, 2015 at 12:21 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > On 7/22/15 9:03 AM, Ankit wrote: > > Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet > ENUMs were treated as Strings in Spark SQL right? So does this mean > partitioning for enums already works in previous versions too since they > are just treated as strings? > > It’s a little bit complicated. A Thrift/Avro/ProtoBuf ENUM value is > represented as a BINARY annotated with original type ENUM in Parquet. For > example, an optional ENUM field e is translated to something like optional > BINARY e (ENUM) in Parquet. And the underlying data is always a UTF8 > string of the ENUM name. However, the Parquet original type ENUM is not > documented, thus Spark 1.3 and 1.4 doesn’t recognize the ENUM annotation > and just see it as a normal BINARY. (I didn’t even notice the existence > of ENUM in Parquet before PR #7048…) > > On the other hand, Spark SQL has a boolean option named > spark.sql.parquet.binaryAsString. When this option is set to true, all > Parquet BINARY values are considered and converted to UTF8 strings. The > original purpose of this option is used to work around a bug of Hive, which > writes strings as plain Parquet BINARY values without a proper UTF8 > annotation. > > That said, by using sqlContext.setConf("spark.sql.parquet.binaryAsString", > "true") in Scala/Java/Python, or SET spark.sql.parquet.binaryAsString=true > in SQL, you may read those ENUM values as plain UTF8 strings. > > > Also, is there a good way to verify that the partitioning is being used? > I tried "explain" like (where data is partitioned by "type" column) > > scala> ev.filter("type = 'NON'").explain > == Physical Plan == > Filter (type#4849 = NON) > PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at > newParquet.scala:573 > > but that is the same even with non partitioned data. > > Do you mean how to verify whether partition pruning is effective? You > should be able to see log lines like this: > > 15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out of 3, > pruned 66.66666666666667% partitions. > > > > On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > >> Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to the >> master branch. https://github.com/apache/spark/pull/7048 >> >> ENUM types are actually not in the Parquet format spec, that's why we >> didn't have it at the first place. Basically, ENUMs are always treated as >> UTF8 strings in Spark SQL now. >> >> Cheng >> >> On 7/22/15 3:41 AM, ankits wrote: >> >>> Hi, I am using a custom build of spark 1.4 with the parquet dependency >>> upgraded to 1.7. I have thrift data encoded with parquet that i want to >>> partition by a column of type ENUM. Spark programming guide says >>> partition >>> discovery is only supported for string and numeric columns, so it seems >>> partition discovery won't work out of the box here. >>> >>> Is there any workaround that will allow me to partition by ENUMs? Will >>> hive >>> partitioning help here? I am unfamiliar with Hive, and how it plays into >>> parquet, thrift and spark so I would appreciate any pointers in the right >>> direction. Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> <http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: <user-h...@spark.apache.org> >>> user-h...@spark.apache.org >>> >>> >>> >> > >