[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

Prasanth Jayachandran (JIRA) Wed, 30 Aug 2017 01:54:14 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146914#comment-16146914
 ]


Prasanth Jayachandran commented on HIVE-17280:
----------------------------------------------

That is certainly not the format that hive expects. After concatenation, merged 
and unmerged (incompatible) files gets moved to a staging directory. Then 
MoveTask moves the files from staging directory to final destination directory 
(which is also the source directory in case of concatenation). There are 
certain assumptions around filenames for bucketing, speculative execution etc. 
in move task. In the example files that you had provided, part-00000_copy_1 and 
part-00001_copy_1 will be considered same file written by different tasks (from 
speculative execution) and the largest file will be picked as the winner of 
speculated execution. This is the same issue as HIVE-17403. Hive usually writes 
files with format 000000_0 where 000000 is task id/bucket id and digit after _  
is considered a task attempt. I am working on a patch that will restrict 
concatenation for external tables. And for hive managed tables, load data 
command will make sure the filenames conform to Hive's expectation. 

> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>
>                 Key: HIVE-17280
>                 URL: https://issues.apache.org/jira/browse/HIVE-17280
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Spark 1.6.3
>            Reporter: Marco Gaido
>            Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

Reply via email to