On 7/22/15 9:03 AM, Ankit wrote:
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet
ENUMs were treated as Strings in Spark SQL right? So does this mean
partitioning for enums already works in previous versions too since
they are just treated as strings?
It’s a little bit complicated. A Thrift/Avro/ProtoBuf |ENUM| value is
represented as a |BINARY| annotated with original type |ENUM| in
Parquet. For example, an optional |ENUM| field |e| is translated to
something like |optional BINARY e (ENUM)| in Parquet. And the underlying
data is always a UTF8 string of the |ENUM| name. However, the Parquet
original type |ENUM| is not documented, thus Spark 1.3 and 1.4 doesn’t
recognize the |ENUM| annotation and just see it as a normal |BINARY|. (I
didn’t even notice the existence of |ENUM| in Parquet before PR #7048…)
On the other hand, Spark SQL has a boolean option named
|spark.sql.parquet.binaryAsString|. When this option is set to |true|,
all Parquet |BINARY| values are considered and converted to UTF8
strings. The original purpose of this option is used to work around a
bug of Hive, which writes strings as plain Parquet |BINARY| values
without a proper |UTF8| annotation.
That said, by using
|sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")| in
Scala/Java/Python, or |SET spark.sql.parquet.binaryAsString=true| in
SQL, you may read those |ENUM| values as plain UTF8 strings.
Also, is there a good way to verify that the partitioning is being
used? I tried "explain" like (where data is partitioned by "type" column)
scala> ev.filter("type = 'NON'").explain
== Physical Plan ==
Filter (type#4849 = NON)
PhysicalRDD [...cols..], MapPartitionsRDD[103] at map at
newParquet.scala:573
but that is the same even with non partitioned data.
Do you mean how to verify whether partition pruning is effective? You
should be able to see log lines like this:
15/07/22 11:14:35 INFO DataSourceStrategy: Selected 1 partitions out
of 3, pruned 66.66666666666667% partitions.
On Tue, Jul 21, 2015 at 4:35 PM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added
to the master branch. https://github.com/apache/spark/pull/7048
ENUM types are actually not in the Parquet format spec, that's why
we didn't have it at the first place. Basically, ENUMs are always
treated as UTF8 strings in Spark SQL now.
Cheng
On 7/22/15 3:41 AM, ankits wrote:
Hi, I am using a custom build of spark 1.4 with the parquet
dependency
upgraded to 1.7. I have thrift data encoded with parquet that
i want to
partition by a column of type ENUM. Spark programming guide
says partition
discovery is only supported for string and numeric columns, so
it seems
partition discovery won't work out of the box here.
Is there any workaround that will allow me to partition by
ENUMs? Will hive
partitioning help here? I am unfamiliar with Hive, and how it
plays into
parquet, thrift and spark so I would appreciate any pointers
in the right
direction. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-parquet-data-by-ENUM-column-tp23939.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>