[
https://issues.apache.org/jira/browse/HIVE-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546146#comment-15546146
]
Sergey Shelukhin commented on HIVE-14886:
-----------------------------------------
[~brocknoland] [~mohitsabharwal] you guys seem to have touched list bucketing
last... are you familiar with that feature?
> File deduplication in FSOP is not used correctly for list bucketing
> -------------------------------------------------------------------
>
> Key: HIVE-14886
> URL: https://issues.apache.org/jira/browse/HIVE-14886
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
>
> I am making things work for MM tables, so I noticed this after adding the
> logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
> {noformat}
> } else /* sershe: means "if !isTempPath(one)" */ {
> String taskId =
> getPrefixedTaskIdFromFilename(one.getPath().getName());
> Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " +
> one.getPath() + ", taskId " + taskId);
> {noformat}
> This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then
> via non-dynpart path in removeTempOrDuplicateFiles/4.
> taskId line is from the original code, so it's used later to decide on the
> fate of the file.
> The files passed in are from the root of the table, disregarding list
> bucketing, so what happens is this:
> {noformat}
> 2016-10-03T19:01:38,615 INFO [912dde0f-91af-4a27-b358-5d782897ed1d main]
> Log14535: removeTempOrDuplicateFiles pondering
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
> taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
> 2016-10-03T19:01:38,616 INFO [912dde0f-91af-4a27-b358-5d782897ed1d main]
> Log14535: removeTempOrDuplicateFiles pondering
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
> taskId 0 [sershe: this is only true by coincidence, task if comes from k1
> value]
> {noformat}
> When I started calling the method correctly on MM path, it started deleting
> files for different LB directories thinking they are the same stuff... so,
> some special logic may be needed for this similar to dpCtx.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)