[jira] [Updated] (HIVE-14886) File deduplication in FSOP is not used correctly for list bucketing

Sergey Shelukhin (JIRA) Tue, 04 Oct 2016 10:59:17 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HIVE-14886:
------------------------------------
    Description: 
I am making things work for MM tables, so I noticed this after adding the 
logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
{noformat}
      } else /* sershe: means "if !isTempPath(one)" */ {
        String taskId = getPrefixedTaskIdFromFilename(one.getPath().getName());
        Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " + 
one.getPath() + ", taskId " + taskId);
{noformat}
This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then via 
non-dynpart path in removeTempOrDuplicateFiles/4.
taskId line is from the original code, so it's used later to decide on the fate 
of the file.
The files passed in are from the root of the table, disregarding list 
bucketing, so what happens is this:
{noformat}
2016-10-03T19:01:38,615  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
Log14535: removeTempOrDuplicateFiles pondering 
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
 taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
2016-10-03T19:01:38,616  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
Log14535: removeTempOrDuplicateFiles pondering 
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
 taskId 0 [sershe: this is only true by coincidence, task if comes from k1 
value]
{noformat}

When I started calling the method correctly on MM path, it started deleting 
files for different LB directories thinking they are the same stuff... so, some 
special logic may be needed for this similar to dpCtx.

  was:
I am making things work for MM tables, so I noticed this after adding the 
logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
{noformat}
      } else /* sershe: means "if !isTempPath(one)" */ {
        String taskId = getPrefixedTaskIdFromFilename(one.getPath().getName());
        Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " + 
one.getPath() + ", taskId " + taskId);
{noformat}
This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then via 
non-dynpart path in removeTempOrDuplicateFiles/4.
taskId line is origin, so it's used later.
The files passed are from the root destination, disregarding list bucketing, so 
what happens is this:
{noformat}
2016-10-03T19:01:38,615  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
Log14535: removeTempOrDuplicateFiles pondering 
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
 taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
2016-10-03T19:01:38,616  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
Log14535: removeTempOrDuplicateFiles pondering 
hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
 taskId 0 [sershe: this is only true by coincidence, task if comes from k1 
value]
{noformat}

When I started calling the method correctly on MM path, it started deleting 
files for different LB directories thinking they are the same stuff... so, some 
special logic may be needed for this similar to dpCtx.


> File deduplication in FSOP is not used correctly for list bucketing
> -------------------------------------------------------------------
>
>                 Key: HIVE-14886
>                 URL: https://issues.apache.org/jira/browse/HIVE-14886
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> I am making things work for MM tables, so I noticed this after adding the 
> logging to removeTempOrDuplicateFiles/2 method that is called from FSOP:
> {noformat}
>       } else /* sershe: means "if !isTempPath(one)" */ {
>         String taskId = 
> getPrefixedTaskIdFromFilename(one.getPath().getName());
>         Utilities.LOG14535.info("removeTempOrDuplicateFiles pondering " + 
> one.getPath() + ", taskId " + taskId);
> {noformat}
> This is called from FSOP jobCloseOp, via Utilities.mvFileToFinalPath, then 
> via non-dynpart path in removeTempOrDuplicateFiles/4.
> taskId line is from the original code, so it's used later to decide on the 
> fate of the file.
> The files passed in are from the root of the table, disregarding list 
> bucketing, so what happens is this:
> {noformat}
> 2016-10-03T19:01:38,615  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
> Log14535: removeTempOrDuplicateFiles pondering 
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME,
>  taskId HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME
> 2016-10-03T19:01:38,616  INFO [912dde0f-91af-4a27-b358-5d782897ed1d main] 
> Log14535: removeTempOrDuplicateFiles pondering 
> hdfs://localhost:63026/build/ql/test/data/warehouse/skew_mm/.hive-staging_hive_2016-10-03_19-01-38_324_9113577068018508885-1/_tmp.-ext-10000/k1=0,
>  taskId 0 [sershe: this is only true by coincidence, task if comes from k1 
> value]
> {noformat}
> When I started calling the method correctly on MM path, it started deleting 
> files for different LB directories thinking they are the same stuff... so, 
> some special logic may be needed for this similar to dpCtx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-14886) File deduplication in FSOP is not used correctly for list bucketing

Reply via email to