[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Ning Zhang (JIRA) Mon, 16 Aug 2010 15:43:43 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899152#action_12899152
 ]


Ning Zhang commented on HIVE-1492:
----------------------------------

Agree that we should catch the exception in (Combine)HiveRecordReader, but they 
are only used in map side. In the reducer, RecordReader was not called and 
there could also be exceptions outside of reducer(). This fix catches that case 
as well.

I've filed another JIRA HIVE-1543 for catching exceptions in RecrodReaders. 



> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Reply via email to