[ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated HIVE-17280:
-------------------------------
    Description: 
Hive concatenation causes data loss if the ORC files in the table were written 
by Spark.

Here are the steps to reproduce the problem:
 - create a table;
{code:java}
hive
hive> create table aa (a string, b int) stored as orc;
{code}
 - insert 2 rows using Spark;
{code:java}
spark-shell
scala> case class AA(a:String, b:Int)
scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
scala> df.write.insertInto("aa")
{code}
 - change table schema;
{code:java}
hive
hive> alter table aa add columns(aa string, bb int);
{code}
 - insert other 2 rows with Spark
{code:java}
spark-shell
scala> case class BB(a:String, b:Int, aa:String, bb:Int)
scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
scala> df.write.insertInto("aa")
{code}
 - at this point, running a select statement with Hive returns correctly *4 
rows* in the table; then run the concatenation
{code:java}
hive
hive> alter table aa concatenate;
{code}


At this point, a select returns only *3 rows, ie. a row is missing*.


  was:
Hive concatenation causes data loss if the ORC files in the table were written 
by Spark.

Here are the steps to reproduce the problem:
 - create a table;
{code:java}
hive
hive> create table aa (a string, b int) stored as orc;
{code}
 - insert 2 rows using Spark;
{code:java}
spark-shell
scala> case class AA(a:String, b:Int)
scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
scala> df.write.insertInto("aa")
{code}
 - change table schema;
{code:java}
hive
hive> alter table aa add columns(aa string, bb int);
{code}
 - insert other 2 rows with Spark
{code:java}
spark-shell
scala> case class BB(a:String, b:Int, aa:String, bb:Int)
scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
scala> df.write.insertInto("aa")
{code}
 - at this point, running a select statement with Hive returns correctly 4 rows 
in the table; then run the concatenation
{code:java}
hive
hive> alter table aa concatenate;
{code}
At this point, a select returns only* 3 rows, ie. a row is missing*.



> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>
>                 Key: HIVE-17280
>                 URL: https://issues.apache.org/jira/browse/HIVE-17280
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Tested in HDP-2.6
>            Reporter: Marco Gaido
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to