[ 
https://issues.apache.org/jira/browse/HIVE-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546146#comment-15546146
 ] 

Sergey Shelukhin commented on HIVE-14886:
-----------------------------------------

[~brocknoland] [~mohitsabharwal] you guys seem to have touched list bucketing 
last... are you familiar with that feature?

> File deduplication in FSOP is not used correctly for list bucketing
> -------------------------------------------------------------------
>
>                 Key: HIVE-14886
>                 URL: https://issues.apache.org/jira/browse/HIVE-14886
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> I am making things work for MM tables, so I noticed this after adding the 
> logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
> {noformat}
>       } else /* sershe: means "if !isTempPath(one)" */ {
>         String taskId = 
> getPrefixedTaskIdFromFilename(one.getPath().getName());
>         Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " + 
> one.getPath() + ", taskId " + taskId);
> {noformat}
> This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then 
> via non-dynpart path in removeTempOrDuplicateFiles/4.
> taskId line is from the original code, so it's used later to decide on the 
> fate of the file.
> The files passed in are from the root of the table, disregarding list 
> bucketing, so what happens is this:
> {noformat}
> 2016-10-03T19:01:38,615  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
> Log14535: removeTempOrDuplicateFiles pondering 
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
>  taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
> 2016-10-03T19:01:38,616  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
> Log14535: removeTempOrDuplicateFiles pondering 
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
>  taskId 0 [sershe: this is only true by coincidence, task if comes from k1 
> value]
> {noformat}
> When I started calling the method correctly on MM path, it started deleting 
> files for different LB directories thinking they are the same stuff... so, 
> some special logic may be needed for this similar to dpCtx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to