[
https://issues.apache.org/jira/browse/HIVE-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754632#action_12754632
]
Todd Lipcon commented on HIVE-718:
----------------------------------
Even in the case of LOAD...OVERWRITE, we currently lack atomicity. The old
directory is deleted prior to the new directory being moved in. There's a small
window of time in which neither old nor new data is present. Sure, this is
probably on the order of a half second, but it is still not correct. There's
also the general case of a query which has already computed input splits being
affected by a concurrent LOAD DATA OVERWRITE. The versioning solution I think
gets us partly there, but will be very tricky to implement correctly while
maintaining performance since there's no way to do a copy-on-write directory
snapshot built in to HDFS. So, I think a significant amount of work will have
to be done in the metastore and we'll definitely have to drop the "external
process load" ability that currently exists.
Should we open a new JIRA for the general concurrency control/locking issues
we're discussing here? It seems like this ticket should be used for the
0.3->0.4 regression, even if it's just a temporary fix. We can then put a more
general correct solution on the roadmap for 0.5 or later, since it's looking
like it will be a complicated project.
> Load data inpath into a new partition without overwrite does not move the file
> ------------------------------------------------------------------------------
>
> Key: HIVE-718
> URL: https://issues.apache.org/jira/browse/HIVE-718
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.4.0
> Reporter: Zheng Shao
> Attachments: HIVE-718.1.patch, HIVE-718.2.patch, hive-718.txt
>
>
> The bug can be reproduced as following. Note that it only happens for
> partitioned tables. The select after the first load returns nothing, while
> the second returns the data correctly.
> insert.txt in the current local directory contains 3 lines: "a", "b" and "c".
> {code}
> > create table tmp_insert_test (value string) stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test;
> > select * from tmp_insert_test;
> a
> b
> c
> > create table tmp_insert_test_p ( value string) partitioned by (ds string)
> > stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition
> > (ds = '2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition
> > (ds = '2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> a 2009-08-01
> b 2009-08-01
> d 2009-08-01
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.