[jira] Commented: (HIVE-217) Stream closed exception

Joydeep Sen Sarma (JIRA) Thu, 08 Jan 2009 12:37:22 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662111#action_12662111
 ]


Joydeep Sen Sarma commented on HIVE-217:
----------------------------------------

ok - got it. hadoop is ok - it's already reporting progress whenever any data 
is consumed or any data is emitted to output collector.

the issue is that we are not sending data to the output collector - rather we 
write out data to file system ourselves. the stack trace indicates that we have 
consumed the entire reduce group and we are writing to the filesystem. that 
means that hadoop gets no opportunity to report progress (it would have 
reported progress if we were writing to the output collector).

I am not sure why we haven't seen the problem in our environment yet - perhaps 
DFS IO is slow in ur environment. the fix is simple (we can report progress 
either in the filesinkoperator or the joinoperator). i think we should do this 
in the joinoperator (for the particular case where we have already gone through 
the entire reduce group) - since that's the only place we are vulnerable right 
now.

i can post a patch u can try out in a couple of hours .. 

> Stream closed exception
> -----------------------
>
>                 Key: HIVE-217
>                 URL: https://issues.apache.org/jira/browse/HIVE-217
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>         Environment: Hive from trunk, hadoop 0.18.2, ~20 machines
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.2.0
>
>         Attachments: HIVE-217.log
>
>
> When running a query similar to the following:
> "insert overwrite table outputtable select a, b, cast(sum(counter) as INT) 
> from tablea join tableb on (tablea.username=tableb.username) join tablec on 
> (tablec.userid = tablea.userid) join tabled on (tablec.id=tabled.id) where 
> insertdate >= 'somedate' and insertdate <= 'someotherdate' group by a, b;"
> Where one table is ~40gb or so and the others are a couple of hundred mb. The 
> error happens in the first mapred job that processes the 40gb.
> I get the following exception (see attached file for full stack trace):
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Stream closed.
>         at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:162)
> It happens in one reduce task and is reproducible, running the same query 
> gives the error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-217) Stream closed exception

Reply via email to