[ 
https://issues.apache.org/jira/browse/HIVE-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754632#action_12754632
 ] 

Todd Lipcon commented on HIVE-718:
----------------------------------

Even in the case of LOAD...OVERWRITE, we currently lack atomicity. The old 
directory is deleted prior to the new directory being moved in. There's a small 
window of time in which neither old nor new data is present. Sure, this is 
probably on the order of a half second, but it is still not correct. There's 
also the general case of a query which has already computed input splits being 
affected by a concurrent LOAD DATA OVERWRITE. The versioning solution I think 
gets us partly there, but will be very tricky to implement correctly while 
maintaining performance since there's no way to do a copy-on-write directory 
snapshot built in to HDFS. So, I think a significant amount of work will have 
to be done in the metastore and we'll definitely have to drop the "external 
process load" ability that currently exists.

Should we open a new JIRA for the general concurrency control/locking issues 
we're discussing here? It seems like this ticket should be used for the 
0.3->0.4 regression, even if it's just a temporary fix. We can then put a more 
general correct solution on the roadmap for 0.5 or later, since it's looking 
like it will be a complicated project.

> Load data inpath into a new partition without overwrite does not move the file
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-718
>                 URL: https://issues.apache.org/jira/browse/HIVE-718
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>         Attachments: HIVE-718.1.patch, HIVE-718.2.patch, hive-718.txt
>
>
> The bug can be reproduced as following. Note that it only happens for 
> partitioned tables. The select after the first load returns nothing, while 
> the second returns the data correctly.
> insert.txt in the current local directory contains 3 lines: "a", "b" and "c".
> {code}
> > create table tmp_insert_test (value string) stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test;
> > select * from tmp_insert_test;
> a
> b
> c
> > create table tmp_insert_test_p ( value string) partitioned by (ds string) 
> > stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition 
> > (ds = '2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition 
> > (ds = '2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> a       2009-08-01
> b       2009-08-01
> d       2009-08-01
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to