[
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
]
Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM:
-------------------------------------------------------------------------
Hi [~dongjoon]
Thanks for pointing that it is a Hive bug. I am already aware that there is a
Hive bug related to this which I have put in the Jira description itself.
But according to the latest spark documentation, it was mentioned that if you
set spark.sql.orc.impl=hive, it would generate orc files that would work with
Hive_2.1.1 or below. That is why I raised this bug because the workaround
mentioned in spark documentation was not working for me.
[https://spark.apache.org/docs/latest/sql-migration-guide.html]
!image-2020-08-09-14-07-00-521.png|width=725,height=91!
It is clearly mentioned here that use spark.sql.orc.impl=hive to create files
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the
fix for this issue, how is this a valid workaround here??)
And regarding the way I am generating the ORC file, I do not agree with your
comments. Please look at this closely.
First I am creating a normal spark sql table here and loading data to it using
insert query. In the second half what I am doing is, reading all the data from
the spark sql table and loading it into spark dataframe. Finally I am writing
the spark data frame content into a new ORC file: (which works for all the file
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1:
org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame:
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()
-----------------
col1 col2
-----------------
col1val1 col2val1
-----------------
scala>
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once
the data comes to a Dataframe, it can be written into any files like ORC, Avro,
Parquet irrespective of where the data has come into data frame.
Please note, here I am not trying to create a ORC Spark SQL table and then I am
trying to generate a ORC file using the Native Spark_3.0 APIs.
How or from where the data comes to data frame is irrelevant I believe here, I
am just loading simple string data as well.
Even the below case is valid, data can be generated this way also and written
into a ORC file. The issue is observed with the file created in this case as
well.
{code:java}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m",
s"phone_$m") }
)
}
records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
We need to at-least fix the Spark3.0 documentation or give detailed explanation
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag
cannot generate ORC files that is not compatible with Hive_2.1.1 and below,
then what is the usage of the same)
We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*,
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with
it I am able to replicate the problem.
I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to
Hive_2.1.1. However, the confusing part in the spark documentation needs to be
fixed I believe.
Thank you.
Ramakrishna
was (Author: ramks):
Hi [~dongjoon]
Thanks for pointing that it is a Hive bug. I am already aware that there is a
Hive bug related to this which I have put in the Jira description itself.
But according to the latest spark documentation, it was mentioned that if you
set spark.sql.orc.impl=hive, it would generate orc files that would work with
Hive_2.1.1 or below. That is why I raised this bug because the workaround
mentioned in spark documentation was not working for me.
[https://spark.apache.org/docs/latest/sql-migration-guide.html]
!image-2020-08-09-14-07-00-521.png|width=725,height=91!
It is clearly mentioned here that use spark.sql.orc.impl=hive to create files
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the
fix for this issue, how is this a valid workaround here??)
And regarding the way I am generating the ORC file, I do not agree with your
comments. Please look at this closely.
First I am creating a normal spark sql table here and loading data to it using
insert query. In the second half what I am doing is, reading all the data from
the spark sql table and loading it into spark dataframe. Finally I am writing
the spark data frame content into a new ORC file: (which works for all the file
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1:
org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame:
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()
-----------------
col1 col2
-----------------
col1val1 col2val1
-----------------
scala>
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once
the data comes to a Dataframe, it can be written into any files like ORC, Avro,
Parquet irrespective of where the data has come into data frame.
Please note, here I am not trying to create a ORC Spark SQL table and then I am
trying to generate a ORC file using the Native Spark_3.0 APIs.
How or from where the data comes to data frame is irrelevant I believe here, I
am just loading simple string data as well.
Even the below case is valid, data can be generated this way also and written
into a ORC file. The issue is observed with the file created in this case as
well.
{code:java}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m",
s"phone_$m") }
)
}
records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
We need to at-least fix the Spark3.0 documentation or give detailed explanation
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag
cannot generate ORC files that is not compatible with Hive_2.1.1 and below,
then what is the usage of the same)
We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*,
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with
it I am able to replicate the problem.
I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to
Hive_2.1.1. However, the confusing part in the spark documentation needs to be
fixed I believe.
Thank you.
Ramakrishna
> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.
> (Linux Redhat)
> Reporter: Ramakrishna Prasad K S
> Priority: Major
> Attachments: image-2020-08-09-14-07-00-521.png
>
>
> Steps to reproduce the issue:
> -------------------------------
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
> scala> spark.sql("set spark.sql.orc.impl").show()
> +-------------------------+
> | key| value|
> +-------------------------+
> |spark.sql.orc.impl|native|
> +-------------------------+
>
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1:
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame:
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-----------------+
> | col1| col2|
> +-----------------+
> |col1val1|col2val1|
> +-----------------+
> scala>
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following
> command to analyze or read metadata from the ORC files. As you see below, it
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc
> Processing data file
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> +------------------------+
> | key|value|
> +------------------------+
> |spark.sql.orc.impl| hive|
> +------------------------+
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8:
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2:
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala>
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code}
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following
> command to analyze or read metadata from the ORC files. As you see below, it
> fails with the same exception to fetch the metadata even after following the
> workaround suggested by spark to set spark.sql.orc.impl to hive
> {code}
> [adpqa@irlhadoop1 bug]$ hive --orcfiledump
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
> Processing data file
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> *Note: The same case works fine if you try metadata fetch from Hive_2.3 or
> above versions.*
> *So the main concern here is that setting {{spark.sql.orc.impl}} to hive is
> not producing ORC files that will work with Hive_2.1.1 or below.
> Can someone help here. Is there any other workaround available? Can this be
> looked into on priority? Thank you.
>
> References:
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] (workaround
> of setting spark.sql.orc.impl=hive is mentioned here which is not working):
> {quote}
> Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC
> files by default. To do that, spark.sql.orc.impl and
> spark.sql.orc.filterPushdown change their default values to native and true
> respectively. ORC files created by native ORC writer cannot be read by some
> old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files
> shared with Hive 2.1.1 and older.""
> {quote}
> https://issues.apache.org/jira/browse/SPARK-26932
> https://issues.apache.org/jira/browse/HIVE-16683
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]