Sergey Shelukhin created HIVE-14886:
---------------------------------------
Summary: File deduplication in FSOP is not used correctly for list
bucketing
Key: HIVE-14886
URL: https://issues.apache.org/jira/browse/HIVE-14886
Project: Hive
Issue Type: Bug
Reporter: Sergey Shelukhin
I am making things work for MM tables, so I noticed this after adding the
logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
{noformat}
} else /* sershe: means "if !isTempPath(one)" */ {
String taskId = getPrefixedTaskIdFromFilename(one.getPath().getName());
Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " +
one.getPath() + ", taskId " + taskId);
{noformat}
This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then via
non-dynpart path in removeTempOrDuplicateFiles/4.
The files passed are from the root destination, disregarding list bucketing, so
what happens is this:
{noformat}
2016-10-03T19:01:38,615 INFO [912dde0f-91af-4a27-b358-5d782897ed1d main]
Log14535: removeTempOrDuplicateFiles pondering
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
2016-10-03T19:01:38,616 INFO [912dde0f-91af-4a27-b358-5d782897ed1d main]
Log14535: removeTempOrDuplicateFiles pondering
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
taskId 0 [sershe: this is only true by coincidence, task if comes from k1
value]
{noformat}
When I started calling the method correctly on MM path, it started deleting
files for different LB directories thinking they are the same stuff... so, some
special logic may be needed for this similar to dpCtx.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)