Thanks Ning,

Our TRANSFORM script is pretty quick, the whole query only took 30 seconds
when we added "and project='abc' " to limit the amount of input data.  So it
seems like the slowdown is somewhere in handling the data before (or after?)
it gets to the transform script.  I can try adding some logging in our
script though, so we can see how far it does get through the data.

We'll take a look at 0.20.2 but it might take a while until we can upgrade
our main cluster to it.

On Fri, Oct 1, 2010 at 12:24 PM, Ning Zhang <nzh...@facebook.com> wrote:

> Hi Dave,
>
> There may be 2 issues here:
>  1) according to the HDFS JIRA it seems to be a bug introduced in Hadoop
> 0.18 and was fixed by another hadoop JIRA in 0.20.2 suggested by Hairong,
> you may want to try 0.20.2 if that's possible.
>  2) the fact that the lease expired seems to be caused by the fact that the
> open file has not get any write for a long time. From you query the
> TRANSFORM script is called for each row that will going to be written to
> HDFS file. Does the python script cause long latency (the time between two
> output records from the transform script)?
>
>
> On Oct 1, 2010, at 8:34 AM, Dave Brondsema wrote:
>
> We're trying to insert into a table, using a dynamic partition, but the
> query runs for a while and then dies with a LeaseExpiredException.  The
> hadoop details & some discussion is at
> https://issues.apache.org/jira/browse/HDFS-198  Is there a way to
> configure hive, or our query, to work around this?  If we adjust our query
> to handle less data at once, it can complete in under 10 minutes, but then
> we have to run the query many more times to get all the data processed.
>
> The query is:
> FROM (
>     FROM (
>         SELECT file, os, country, dt, project
>         FROM downloads WHERE dt='2010-10-01'
>         DISTRIBUTE BY project
>         SORT BY project asc, file asc
>     ) a
>     SELECT TRANSFORM(file, os, country, dt, project)
>     USING 'transformwrap reduce.py'
>     AS (file, downloads, os, country, project)
> ) b
> INSERT OVERWRITE TABLE dl_day PARTITION (dt='2010-10-01', project)
> SELECT file, downloads, os, country, FALSE, project
>
> The project partition has roughly 100000 values.
>
> We're using Hive trunk from about a month ago.
>  Hadoop 0.18.3-14.cloudera.CH0_3
>
> --
> Dave Brondsema
> Software Engineer
> Geeknet
>
> www.geek.net
>
>
>


-- 
Dave Brondsema
Software Engineer
Geeknet

www.geek.net

Reply via email to