The first query will not return unless it copied the files to the dest directory and this operation is atomic (FileSystem.rename() guarantees that). Since second query is not executed until the first query returns, this problem may be due to a bug in HDFS (highly unlikely) or an issue with HDFS configuration or related to EC3.
The second query knows the file name ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was able to successfully call getFileStatus() on it but the mapper (of second query) is not able to do the same thing. So either this file has been deleted after the Hive client accessed it but before the mapper access it or the machine on which the mapper is being executed can’t see this file. Can you manually check whether the file exists at all after the job fails? Prasad ________________________________ From: Eva Tse <[email protected]> Reply-To: <[email protected]> Date: Wed, 9 Sep 2009 10:19:24 -0700 To: <[email protected]> Subject: Re: Files does not exist error: concurrency control on hive queries... Prasad, We believe the problem is that one of the query is doing an ‘insert overwrite ... select from’ which actually is deleting and merging the small files. The other query somehow couldn’t find those files that it thought it has seen before and failed. So, it looks like a concurrency issue. Yongqiang, Could you elaborate a bit on why you say this is not a bug? Thanks, Eva. On 9/9/09 9:55 AM, "Prasad Chakka" <[email protected]> wrote: If a certain input file/dir does not exist then the job can’t be submitted. Since only a few reducers are failing, the problem could be something else. Eva, Does the same job succeed on a second try? Ie. Is the file/dir available eventually? What is the replication factor? Prasad ________________________________ From: Yongqiang He <[email protected]> Reply-To: <[email protected]> Date: Wed, 9 Sep 2009 04:07:31 -0700 To: <[email protected]> Subject: Re: Files does not exist error: concurrency control on hive queries... Hi Eva, After a close at the code, I think this is not a bug. We need to find out how to avoid this. Thanks, Yongqiang On 09-9-9 下午1:31, "He Yongqiang" <[email protected]> wrote: Hi Eva, Can you open a new jira for this? And let’s discuss and resolve this issue. I guess this is because the partition metadata is added before the data is available. Thanks Yongqiang On 09-9-9 下午1:18, "Eva Tse" <[email protected]> wrote: We are planning to start enabling ad-hoc querying on our hive warehouse and we tested some of the concurrent queries and found the following issue: Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) select ... from yyy where dateint = xxx’ This is done to merge small files within a partition in table yyy Query 2 – doing some select on the same table joining another table. What we found is that query 2 would fail with the following exceptions in multiple reducers. java.io.FileNotFoundException: File does not exist: hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Is this expected? If so, is there a jira or is it planned to be addressed? We are trying to think of workaround, but haven’t thought of good ones as swapping of files would ideally be handled inside hive. Please let us know your feedback. Thanks, Eva.
