[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

Ramakrishna Prasad K S (Jira) Sun, 09 Aug 2020 01:58:51 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ]


Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM:
-------------------------------------------------------------------------

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-----------------

    col1     col2
-----------------

col1val1 col2val1
-----------------
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-----------------

    col1     col2
-----------------

col1val1 col2val1
-----------------
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32558
>                 URL: https://issues.apache.org/jira/browse/SPARK-32558
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>            Reporter: Ramakrishna Prasad K S
>            Priority: Major
>         Attachments: image-2020-08-09-14-07-00-521.png
>
>
> Steps to reproduce the issue:
> ------------------------------- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-------------------------+
> |               key| value| 
> +-------------------------+
> |spark.sql.orc.impl|native|
> +-------------------------+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-----------------+
> |    col1|    col2|
> +-----------------+
> |col1val1|col2val1|
> +-----------------+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> +------------------------+
> |               key|value| 
> +------------------------+
> |spark.sql.orc.impl| hive|
> +------------------------+
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails with the same exception to fetch the metadata even after following the 
> workaround suggested by spark to set spark.sql.orc.impl to hive
> {code}
> [adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
> Processing data file 
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
>  [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> *Note: The same case works fine if you try metadata fetch from Hive_2.3 or 
> above versions.*
> *So the main concern here is that setting {{spark.sql.orc.impl}} to hive is 
> not producing ORC files that will work with Hive_2.1.1 or below.
>  Can someone help here. Is there any other workaround available? Can this be 
> looked into on priority? Thank you.
>  
> References:
>  [https://spark.apache.org/docs/latest/sql-migration-guide.html]  (workaround 
> of setting spark.sql.orc.impl=hive is mentioned here which is not working):
> {quote}
> Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC 
> files by default. To do that, spark.sql.orc.impl and 
> spark.sql.orc.filterPushdown change their default values to native and true 
> respectively. ORC files created by native ORC writer cannot be read by some 
> old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files 
> shared with Hive 2.1.1 and older.""
> {quote}
> https://issues.apache.org/jira/browse/SPARK-26932
> https://issues.apache.org/jira/browse/HIVE-16683



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

Reply via email to