We found this error had to do with the Hive Query plan getting stepped on because of some shared state in org.apache.hadoop.hive.ql.exec.Utilities.
I attached a patch that fixed this for us to HIVE-80. -Cliff On 09/09/2009 01:29 PM, Prasad Chakka wrote: > The first query will not return unless it copied the files to the dest > directory and this operation is atomic (FileSystem.rename() guarantees > that). Since second query is not executed until the first query > returns, this problem may be due to a bug in HDFS (highly unlikely) or > an issue with HDFS configuration or related to EC3. > > The second query knows the file name > ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client > does was able to successfully call getFileStatus() on it but the > mapper (of second query) is not able to do the same thing. So either > this file has been deleted after the Hive client accessed it but > before the mapper access it or the machine on which the mapper is > being executed can’t see this file. Can you manually check whether the > file exists at all after the job fails? > > Prasad > > > ------------------------------------------------------------------------ > *From: *Eva Tse <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Wed, 9 Sep 2009 10:19:24 -0700 > *To: *<[email protected]> > *Subject: *Re: Files does not exist error: concurrency control on hive > queries... > > > Prasad, > We believe the problem is that one of the query is doing an ‘insert > overwrite ... select from’ which actually is deleting and merging the > small files. The other query somehow couldn’t find those files that it > thought it has seen before and failed. So, it looks like a concurrency > issue. > > Yongqiang, > Could you elaborate a bit on why you say this is not a bug? > > Thanks, > Eva. > > > On 9/9/09 9:55 AM, "Prasad Chakka" <[email protected]> wrote: > > If a certain input file/dir does not exist then the job can’t be > submitted. Since only a few reducers are failing, the problem > could be something else. > Eva, Does the same job succeed on a second try? Ie. Is the > file/dir available eventually? What is the replication factor? > > Prasad > > > ------------------------------------------------------------------------ > *From: *Yongqiang He <[email protected]> > *Reply-To: *<[email protected]> > *Date: *Wed, 9 Sep 2009 04:07:31 -0700 > *To: *<[email protected]> > *Subject: *Re: Files does not exist error: concurrency control on > hive queries... > > Hi Eva, > After a close at the code, I think this is not a bug. We need to > find out how to avoid this. > > Thanks, > Yongqiang > On 09-9-9 下午1:31, "He Yongqiang" > <[email protected]> wrote: > > Hi Eva, > Can you open a new jira for this? And let’s discuss and > resolve this issue. > I guess this is because the partition metadata is added before > the data is available. > > Thanks > Yongqiang > On 09-9-9 下午1:18, "Eva Tse" <[email protected]> wrote: > > > We are planning to start enabling ad-hoc querying on our > hive warehouse and we tested some of the concurrent > queries and found the following issue: > > Query 1 – doing ‘insert overwrite table yyy .... partition > (dateint = xxx) select ... from yyy where dateint = xxx’ > This is done to merge small files within a partition in > table yyy > Query 2 – doing some select on the same table joining > another table. > > What we found is that query 2 would fail with the > following exceptions in multiple reducers. > java.io.FileNotFoundException: File does not exist: > > hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006 > at > > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at > org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412) > at > > org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63) > at > > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236) > at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Is this expected? If so, is there a jira or is it planned > to be addressed? We are trying to think of workaround, but > haven’t thought of good ones as swapping of files would > ideally be handled inside hive. > > Please let us know your feedback. > > Thanks, > Eva. > > > > >
