Thanks Ning, Our TRANSFORM script is pretty quick, the whole query only took 30 seconds when we added "and project='abc' " to limit the amount of input data. So it seems like the slowdown is somewhere in handling the data before (or after?) it gets to the transform script. I can try adding some logging in our script though, so we can see how far it does get through the data.
We'll take a look at 0.20.2 but it might take a while until we can upgrade our main cluster to it. On Fri, Oct 1, 2010 at 12:24 PM, Ning Zhang <nzh...@facebook.com> wrote: > Hi Dave, > > There may be 2 issues here: > 1) according to the HDFS JIRA it seems to be a bug introduced in Hadoop > 0.18 and was fixed by another hadoop JIRA in 0.20.2 suggested by Hairong, > you may want to try 0.20.2 if that's possible. > 2) the fact that the lease expired seems to be caused by the fact that the > open file has not get any write for a long time. From you query the > TRANSFORM script is called for each row that will going to be written to > HDFS file. Does the python script cause long latency (the time between two > output records from the transform script)? > > > On Oct 1, 2010, at 8:34 AM, Dave Brondsema wrote: > > We're trying to insert into a table, using a dynamic partition, but the > query runs for a while and then dies with a LeaseExpiredException. The > hadoop details & some discussion is at > https://issues.apache.org/jira/browse/HDFS-198 Is there a way to > configure hive, or our query, to work around this? If we adjust our query > to handle less data at once, it can complete in under 10 minutes, but then > we have to run the query many more times to get all the data processed. > > The query is: > FROM ( > FROM ( > SELECT file, os, country, dt, project > FROM downloads WHERE dt='2010-10-01' > DISTRIBUTE BY project > SORT BY project asc, file asc > ) a > SELECT TRANSFORM(file, os, country, dt, project) > USING 'transformwrap reduce.py' > AS (file, downloads, os, country, project) > ) b > INSERT OVERWRITE TABLE dl_day PARTITION (dt='2010-10-01', project) > SELECT file, downloads, os, country, FALSE, project > > The project partition has roughly 100000 values. > > We're using Hive trunk from about a month ago. > Hadoop 0.18.3-14.cloudera.CH0_3 > > -- > Dave Brondsema > Software Engineer > Geeknet > > www.geek.net > > > -- Dave Brondsema Software Engineer Geeknet www.geek.net