Re: Files does not exist error: concurrency control on hive queries...

Eva Tse Thu, 10 Sep 2009 08:37:36 -0700

Zookeeper sounds like a decent alternative, though it would add a new
dependency for deployment.
Maybe we could open a jira for it first to track this issue?
Thanks,
Eva.



On 9/9/09 2:49 PM, "Prasad Chakka" <[email protected]> wrote:

> Yeah, metastore db is logical place to do locking but there have to be
> periodic cleanups (when clients die without releasing locks) etc which is
> hacky so less preferrable. Another option is to point a ZooKeeper cluster to
> Hive and ask Hive to use it for locking. So those who are not concerned about
> concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper
> provides leases so there won’t be any problem of hanging locks and it will be
> easier for admins to clean it up.
> 
> I suppose it depends on whoever wants to take this task up :)
> 
> Prasad
> 
> 
> 
> From: Eva Tse <[email protected]>
> Reply-To: <[email protected]>
> Date: Wed, 9 Sep 2009 14:32:20 -0700
> To: <[email protected]>
> Subject: Re: Files does not exist error: concurrency control on hive
> queries...
> 
> 
> Regardless of whether the user uses a HiveServer, looks like the logical place
> to do locking or concurrency control would be at the metastore DB. This is
> actually one big advantage of Hive. The r/w lock or access control can be
> achieved by a DB row with lock count for each partition, etc. This might be
> over-simplfying it, but the metastore DB seems to be the ideal candidate.
> Thoughts?
> 
> 
> On 9/9/09 12:52 PM, "Prasad Chakka" <[email protected]> wrote:
> 
>> I thought your script runs the two job sequentially. If these two queries are
>> run in parallel then the error can be expected since Hive doesn’t try to
>> acquire locks before reading or writing. I don’t think there are any plans to
>> support this kind of locking (this can only be done if all queries go through
>> HiveServer otherwise lot of orphaned locks will bring the system to halt). I
>> think you should do some kind of locking (possibly with HDFS files) to
>> prevent queries being executed simultaneously.
>> 
>> Any other ideas?
>> 
>> Prasad
>> 
>> 
>> 
>> From: Eva Tse <[email protected]>
>> Reply-To: <[email protected]>
>> Date: Wed, 9 Sep 2009 12:36:11 -0700
>> To: <[email protected]>, Dhruba Borthakur <[email protected]>
>> Subject: Re: Files does not exist error: concurrency control on hive
>> queries...
>> 
>> Hi Prasad,
>> 
>> Are you implying the expected behavior for these queries should be run
>> sequentially by hive because one is r/w and one is read-only ?
>> 
>> For clarifications, these two queries are running concurrently in two
>> separate jobs as below.
>> 
>>> Query 1 is run within a job that does the following essentially:
>>> For every hour:
>>>    - parse log files to generate completed sessions information
>>>    - load completed sessions into 48 partitions (for the prior 48 hours)
>>>    - merge small files using ‘insert overwrite ... select from’ on every
>>> other 8 partitions. Essentially, we would issue 6 separate queries to merge
>>> 6 partitions at the same time, not sequentially. (We do this to minimize
>>> time required.) And this is query 1.
>>> 
>>> Query 2 is run within another job that does select on 24 partitions (aka
>>> daily summary) for the previous day. This job just run this query in a loop
>>> for testing purposes.
>> 
>> The error comes from query 2 saying ‘file not found’ for a file that we are
>> merging in query 1 at that point in time.
>> 
>> We need to rerun the test to be able to catch the failure at that time to see
>> if the file was there at that instance. In the previous run, the merge query
>> succeeded, so I would imagine the file not there after the merge. And, am not
>> sure if that file was still there at that instance when the failure happens.
>> 
>> Thanks for the help!
>> Eva.
>> 
>> On 9/9/09 10:29 AM, "Prasad Chakka" <[email protected]> wrote:
>> 
>>> The first query will not return unless it copied the files to the dest
>>> directory and this operation is atomic (FileSystem.rename() guarantees
>>> that). Since second query is not executed until the first query returns,
>>> this problem may be due to a bug in HDFS (highly unlikely) or an issue with
>>> HDFS configuration or related to EC3.
>>> 
>>> The second query knows the file name
>>> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was
>>> able to successfully call getFileStatus() on it but the mapper (of second
>>> query) is not able to do the same thing. So either this file has been
>>> deleted after the Hive client accessed it but before the mapper access it or
>>> the machine on which the mapper is being executed can’t see this file. Can
>>> you manually check whether the file exists at all after the job fails?
>>> 
>>> Prasad
>>> 
>>> 
>>> 
>>> From: Eva Tse <[email protected]>
>>> Reply-To: <[email protected]>
>>> Date: Wed, 9 Sep 2009 10:19:24 -0700
>>> To: <[email protected]>
>>> Subject: Re: Files does not exist error: concurrency control on hive
>>> queries...
>>> 
>>> 
>>> Prasad,
>>> We believe the problem is that one of the query is doing an ‘insert
>>> overwrite ... select from’ which actually is deleting and merging the small
>>> files. The other query somehow couldn’t find those files that it thought it
>>> has seen before and failed. So, it looks like a concurrency issue.
>>> 
>>> Yongqiang,
>>> Could you elaborate a bit on why you say this is not a bug?
>>> 
>>> Thanks,
>>> Eva.
>>> 
>>> 
>>> On 9/9/09 9:55 AM, "Prasad Chakka" <[email protected]> wrote:
>>> 
>>>> If a certain input file/dir does not exist then the job can’t be submitted.
>>>> Since only a few reducers are failing, the problem could be something else.
>>>> Eva, Does the same job succeed on a second try? Ie. Is the file/dir
>>>> available eventually? What is the replication factor?
>>>> 
>>>> Prasad
>>>> 
>>>> 
>>>> 
>>>> From: Yongqiang He <[email protected]>
>>>> Reply-To: <[email protected]>
>>>> Date: Wed, 9 Sep 2009 04:07:31 -0700
>>>> To: <[email protected]>
>>>> Subject: Re: Files does not exist error: concurrency control on hive
>>>> queries...
>>>> 
>>>> Hi Eva,
>>>>    After a close at the code, I think this is not a bug. We need to find
>>>> out how to avoid this.
>>>> 
>>>> Thanks,
>>>> Yongqiang
>>>> On 09-9-9 下午1:31, "He Yongqiang" <[email protected]> wrote:
>>>> 
>>>>> Hi Eva,
>>>>>     Can you open a new jira for this?  And let’s discuss and resolve this
>>>>> issue. 
>>>>> I guess this is because the partition metadata is added before the data is
>>>>> available. 
>>>>> 
>>>>> Thanks
>>>>> Yongqiang
>>>>> On 09-9-9 下午1:18, "Eva Tse" <[email protected]> wrote:
>>>>> 
>>>>>> 
>>>>>> We are planning to start enabling ad-hoc querying on our hive warehouse
>>>>>> and we tested some of the concurrent queries and found the following
>>>>>> issue:
>>>>>> 
>>>>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint =
>>>>>> xxx) select ...  from yyy where dateint = xxx’  This is done to merge
>>>>>> small files within a partition in table yyy
>>>>>> Query 2 – doing some select on the same table joining another table.
>>>>>> 
>>>>>> What we found is that query 2 would fail with the following exceptions in
>>>>>> multiple reducers.
>>>>>> java.io.FileNotFoundException: File does not exist:
>>>>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_s
>>>>>> ession_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090
>>>>>> 908T09-r-00006
>>>>>>  at 
>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFil
>>>>>> eSystem.java:457)
>>>>>>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671)
>>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
>>>>>>  at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
>>>>>>  at 
>>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordRead
>>>>>> er.java:43)
>>>>>>  at 
>>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(Sequence
>>>>>> FileInputFormat.java:63)
>>>>>>  at 
>>>>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFor
>>>>>> mat.java:236)
>>>>>>  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>>>>>>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>>>>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>>> 
>>>>>> Is this expected? If so, is there a jira or is it planned to be
>>>>>> addressed? We are trying to think of workaround, but haven’t thought of
>>>>>> good ones as swapping of files would ideally be handled inside hive.
>>>>>> 
>>>>>> Please let us know your feedback.
>>>>>> 
>>>>>> Thanks,
>>>>>> Eva.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Files does not exist error: concurrency control on hive queries...

Reply via email to