Zookeeper sounds like a decent alternative, though it would add a new dependency for deployment. Maybe we could open a jira for it first to track this issue? Thanks, Eva.
On 9/9/09 2:49 PM, "Prasad Chakka" <[email protected]> wrote: > Yeah, metastore db is logical place to do locking but there have to be > periodic cleanups (when clients die without releasing locks) etc which is > hacky so less preferrable. Another option is to point a ZooKeeper cluster to > Hive and ask Hive to use it for locking. So those who are not concerned about > concurrency control, don’t have to install ZooKeeper but other can. ZooKeeper > provides leases so there won’t be any problem of hanging locks and it will be > easier for admins to clean it up. > > I suppose it depends on whoever wants to take this task up :) > > Prasad > > > > From: Eva Tse <[email protected]> > Reply-To: <[email protected]> > Date: Wed, 9 Sep 2009 14:32:20 -0700 > To: <[email protected]> > Subject: Re: Files does not exist error: concurrency control on hive > queries... > > > Regardless of whether the user uses a HiveServer, looks like the logical place > to do locking or concurrency control would be at the metastore DB. This is > actually one big advantage of Hive. The r/w lock or access control can be > achieved by a DB row with lock count for each partition, etc. This might be > over-simplfying it, but the metastore DB seems to be the ideal candidate. > Thoughts? > > > On 9/9/09 12:52 PM, "Prasad Chakka" <[email protected]> wrote: > >> I thought your script runs the two job sequentially. If these two queries are >> run in parallel then the error can be expected since Hive doesn’t try to >> acquire locks before reading or writing. I don’t think there are any plans to >> support this kind of locking (this can only be done if all queries go through >> HiveServer otherwise lot of orphaned locks will bring the system to halt). I >> think you should do some kind of locking (possibly with HDFS files) to >> prevent queries being executed simultaneously. >> >> Any other ideas? >> >> Prasad >> >> >> >> From: Eva Tse <[email protected]> >> Reply-To: <[email protected]> >> Date: Wed, 9 Sep 2009 12:36:11 -0700 >> To: <[email protected]>, Dhruba Borthakur <[email protected]> >> Subject: Re: Files does not exist error: concurrency control on hive >> queries... >> >> Hi Prasad, >> >> Are you implying the expected behavior for these queries should be run >> sequentially by hive because one is r/w and one is read-only ? >> >> For clarifications, these two queries are running concurrently in two >> separate jobs as below. >> >>> Query 1 is run within a job that does the following essentially: >>> For every hour: >>> - parse log files to generate completed sessions information >>> - load completed sessions into 48 partitions (for the prior 48 hours) >>> - merge small files using ‘insert overwrite ... select from’ on every >>> other 8 partitions. Essentially, we would issue 6 separate queries to merge >>> 6 partitions at the same time, not sequentially. (We do this to minimize >>> time required.) And this is query 1. >>> >>> Query 2 is run within another job that does select on 24 partitions (aka >>> daily summary) for the previous day. This job just run this query in a loop >>> for testing purposes. >> >> The error comes from query 2 saying ‘file not found’ for a file that we are >> merging in query 1 at that point in time. >> >> We need to rerun the test to be able to catch the failure at that time to see >> if the file was there at that instance. In the previous run, the merge query >> succeeded, so I would imagine the file not there after the merge. And, am not >> sure if that file was still there at that instance when the failure happens. >> >> Thanks for the help! >> Eva. >> >> On 9/9/09 10:29 AM, "Prasad Chakka" <[email protected]> wrote: >> >>> The first query will not return unless it copied the files to the dest >>> directory and this operation is atomic (FileSystem.rename() guarantees >>> that). Since second query is not executed until the first query returns, >>> this problem may be due to a bug in HDFS (highly unlikely) or an issue with >>> HDFS configuration or related to EC3. >>> >>> The second query knows the file name >>> ‘sessionsFacts_P20090909T021823L20090908T09-r-00006’ so Hive client does was >>> able to successfully call getFileStatus() on it but the mapper (of second >>> query) is not able to do the same thing. So either this file has been >>> deleted after the Hive client accessed it but before the mapper access it or >>> the machine on which the mapper is being executed can’t see this file. Can >>> you manually check whether the file exists at all after the job fails? >>> >>> Prasad >>> >>> >>> >>> From: Eva Tse <[email protected]> >>> Reply-To: <[email protected]> >>> Date: Wed, 9 Sep 2009 10:19:24 -0700 >>> To: <[email protected]> >>> Subject: Re: Files does not exist error: concurrency control on hive >>> queries... >>> >>> >>> Prasad, >>> We believe the problem is that one of the query is doing an ‘insert >>> overwrite ... select from’ which actually is deleting and merging the small >>> files. The other query somehow couldn’t find those files that it thought it >>> has seen before and failed. So, it looks like a concurrency issue. >>> >>> Yongqiang, >>> Could you elaborate a bit on why you say this is not a bug? >>> >>> Thanks, >>> Eva. >>> >>> >>> On 9/9/09 9:55 AM, "Prasad Chakka" <[email protected]> wrote: >>> >>>> If a certain input file/dir does not exist then the job can’t be submitted. >>>> Since only a few reducers are failing, the problem could be something else. >>>> Eva, Does the same job succeed on a second try? Ie. Is the file/dir >>>> available eventually? What is the replication factor? >>>> >>>> Prasad >>>> >>>> >>>> >>>> From: Yongqiang He <[email protected]> >>>> Reply-To: <[email protected]> >>>> Date: Wed, 9 Sep 2009 04:07:31 -0700 >>>> To: <[email protected]> >>>> Subject: Re: Files does not exist error: concurrency control on hive >>>> queries... >>>> >>>> Hi Eva, >>>> After a close at the code, I think this is not a bug. We need to find >>>> out how to avoid this. >>>> >>>> Thanks, >>>> Yongqiang >>>> On 09-9-9 下午1:31, "He Yongqiang" <[email protected]> wrote: >>>> >>>>> Hi Eva, >>>>> Can you open a new jira for this? And let’s discuss and resolve this >>>>> issue. >>>>> I guess this is because the partition metadata is added before the data is >>>>> available. >>>>> >>>>> Thanks >>>>> Yongqiang >>>>> On 09-9-9 下午1:18, "Eva Tse" <[email protected]> wrote: >>>>> >>>>>> >>>>>> We are planning to start enabling ad-hoc querying on our hive warehouse >>>>>> and we tested some of the concurrent queries and found the following >>>>>> issue: >>>>>> >>>>>> Query 1 – doing ‘insert overwrite table yyy .... partition (dateint = >>>>>> xxx) select ... from yyy where dateint = xxx’ This is done to merge >>>>>> small files within a partition in table yyy >>>>>> Query 2 – doing some select on the same table joining another table. >>>>>> >>>>>> What we found is that query 2 would fail with the following exceptions in >>>>>> multiple reducers. >>>>>> java.io.FileNotFoundException: File does not exist: >>>>>> hdfs://xxxxxxxxxxxxx.ec2.internal:9000/user/hive/dataeng/warehouse/nccp_s >>>>>> ession_facts/dateint=20090908/hour=9/sessionsFacts_P20090909T021823L20090 >>>>>> 908T09-r-00006 >>>>>> at >>>>>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFil >>>>>> eSystem.java:457) >>>>>> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:671) >>>>>> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417) >>>>>> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412) >>>>>> at >>>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordRead >>>>>> er.java:43) >>>>>> at >>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(Sequence >>>>>> FileInputFormat.java:63) >>>>>> at >>>>>> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFor >>>>>> mat.java:236) >>>>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336) >>>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>>>>> at org.apache.hadoop.mapred.Child.main(Child.java:170) >>>>>> >>>>>> Is this expected? If so, is there a jira or is it planned to be >>>>>> addressed? We are trying to think of workaround, but haven’t thought of >>>>>> good ones as swapping of files would ideally be handled inside hive. >>>>>> >>>>>> Please let us know your feedback. >>>>>> >>>>>> Thanks, >>>>>> Eva. >>>>> >>>> >>>> >>> >>> >> >> > >
