Re: Files does not exist error: concurrency control on hive queries...

Prasad Chakka Wed, 09 Sep 2009 12:52:57 -0700

I thought your script runs the two job sequentially. If these two queries are 
run in parallel then the error can be expected since Hive doesn’t try to 
acquire locks before reading or writing. I don’t think there are any plans to 
support this kind of locking (this can only be done if all queries go through 
HiveServer otherwise lot of orphaned locks will bring the system to halt). I 
think you should do some kind of locking (possibly with HDFS files) to prevent 
queries being executed simultaneously.

Any other ideas?

Prasad

________________________________
From: Eva Tse <[email protected]>
Reply-To: <[email protected]>
Date: Wed, 9 Sep 2009 12:36:11 -0700
To: <[email protected]>, Dhruba Borthakur <[email protected]>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Prasad,

Are you implying the expected behavior for these queries should be run 
sequentially by hive because one is r/w and one is read-only ?

For clarifications, these two queries are running concurrently in two separate 
jobs as below.

Query 1 is run within a job that does the following essentially:
For every hour:
   - parse log files to generate completed sessions information
   - load completed sessions into 48 partitions (for the prior 48 hours)
   - merge small files using ‘insert overwrite ... select from’ on every other 
8 partitions. Essentially, we would issue 6 separate queries to merge 6 
partitions at the same time, not sequentially. (We do this to minimize time 
required.) And this is query 1.

Query 2 is run within another job that does select on 24 partitions (aka daily 
summary) for the previous day. This job just run this query in a loop for 
testing purposes.

The error comes from query 2 saying ‘file not found’ for a file that we are 
merging in query 1 at that point in time.

We need to rerun the test to be able to catch the failure at that time to see 
if the file was there at that instance. In the previous run, the merge query 
succeeded, so I would imagine the file not there after the merge. And, am not 
sure if that file was still there at that instance when the failure happens.

Thanks for the help!
Eva.

On 9/9/09 10:29 AM, "Prasad Chakka" <[email protected]> wrote:

The first query will not return unless it copied the files to the dest 
directory and this operation is atomic (FileSystem.rename() guarantees that). 
Since second query is not executed until the first query returns, this problem 
may be due to a bug in HDFS (highly unlikely) or an issue with HDFS 
configuration or related to EC3.

The second query knows the file name 
‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was 
able to successfully call getFileStatus() on it but the mapper (of second 
query) is not able to do the same thing. So either this file has been deleted 
after the Hive client accessed it but before the mapper access it or the 
machine on which the mapper is being executed can’t see this file. Can you 
manually check whether the file exists at all after the job fails?

Prasad

________________________________
From: Eva Tse <[email protected]>
Reply-To: <[email protected]>
Date: Wed, 9 Sep 2009 10:19:24 -0700
To: <[email protected]>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Prasad,
We believe the problem is that one of the query is doing an ‘insert overwrite 
... select from’ which actually is deleting and merging the small files. The 
other query somehow couldn’t find those files that it thought it has seen 
before and failed. So, it looks like a concurrency issue.

Yongqiang,
Could you elaborate a bit on why you say this is not a bug?

Thanks,
Eva.

On 9/9/09 9:55 AM, "Prasad Chakka" <[email protected]> wrote:

If a certain input file/dir does not exist then the job can’t be submitted. 
Since only a few reducers are failing, the problem could be something else.
Eva, Does the same job succeed on a second try? Ie. Is the file/dir available 
eventually? What is the replication factor?

Prasad

________________________________
From: Yongqiang He <[email protected]>
Reply-To: <[email protected]>
Date: Wed, 9 Sep 2009 04:07:31 -0700
To: <[email protected]>
Subject: Re: Files does not exist error: concurrency control on hive queries...

Hi Eva,
   After a close at the code, I think this is not a bug. We need to find out 
how to avoid this.

Thanks,
Yongqiang
On 09-9-9 下午1:31, "He Yongqiang" <[email protected]> wrote:

Hi Eva,
    Can you open a new jira for this?  And let’s discuss and resolve this issue.
I guess this is because the partition metadata is added before the data is 
available.

Thanks
Yongqiang
On 09-9-9 下午1:18, "Eva Tse" <[email protected]> wrote:

We are planning to start enabling ad-hoc querying on our hive warehouse and we 
tested some of the concurrent queries and found the following issue:

Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = xxx) 
select ...  from yyy where dateint = xxx’  This is done to merge small files 
within a partition in table yyy
Query 2 – doing some select on the same table joining another table.

What we found is that query 2 would fail with the following exceptions in 
multiple reducers.
java.io.FileNotFoundException: File does not exist: 
hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_session_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090908T09-r-00006
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
 at 
org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
 at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:63)
 at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

Is this expected? If so, is there a jira or is it planned to be addressed? We 
are trying to think of workaround, but haven’t thought of good ones as swapping 
of files would ideally be handled inside hive.

Please let us know your feedback.

Thanks,
Eva.

Re: Files does not exist error: concurrency control on hive queries...

Reply via email to