Hi Dave,

There may be 2 issues here:
 1) according to the HDFS JIRA it seems to be a bug introduced in Hadoop 0.18 
and was fixed by another hadoop JIRA in 0.20.2 suggested by Hairong, you may 
want to try 0.20.2 if that's possible.
 2) the fact that the lease expired seems to be caused by the fact that the 
open file has not get any write for a long time. From you query the TRANSFORM 
script is called for each row that will going to be written to HDFS file. Does 
the python script cause long latency (the time between two output records from 
the transform script)?


On Oct 1, 2010, at 8:34 AM, Dave Brondsema wrote:

We're trying to insert into a table, using a dynamic partition, but the query 
runs for a while and then dies with a LeaseExpiredException.  The hadoop 
details & some discussion is at https://issues.apache.org/jira/browse/HDFS-198  
Is there a way to configure hive, or our query, to work around this?  If we 
adjust our query to handle less data at once, it can complete in under 10 
minutes, but then we have to run the query many more times to get all the data 
processed.

The query is:
FROM (
    FROM (
        SELECT file, os, country, dt, project
        FROM downloads WHERE dt='2010-10-01'
        DISTRIBUTE BY project
        SORT BY project asc, file asc
    ) a
    SELECT TRANSFORM(file, os, country, dt, project)
    USING 'transformwrap reduce.py'
    AS (file, downloads, os, country, project)
) b
INSERT OVERWRITE TABLE dl_day PARTITION (dt='2010-10-01', project)
SELECT file, downloads, os, country, FALSE, project

The project partition has roughly 100000 values.

We're using Hive trunk from about a month ago.  Hadoop 0.18.3-14.cloudera.CH0_3

--
Dave Brondsema
Software Engineer
Geeknet

www.geek.net<http://www.geek.net/>

Reply via email to