[
https://issues.apache.org/jira/browse/HIVE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062047#comment-17062047
]
Sungwoo commented on HIVE-21164:
--------------------------------
[~kuczoram] Do you know if Utilities.handleDirectInsertTableFinalPath() with
the same arguments may be called more than once from
FileSinkOperator.jobCloseOp() when running a query? More specifically, assuming
that we use S3 instead of HDFS, I wonder if the following scenario is feasible,
or if Utilities.handleDirectInsertTableFinalPath() with the same argument is
never called more than once.
1. Utilities.handleDirectInsertTableFinalPath() is called
- manifests[] is computed okay
- directInsertDirectories[] is computed okay
- committed[] is computed okay from manifests[]
- manifest directory is deleted
- directInsertDirectories[] is inspected against committed[] in
cleanDirectInsertDirectory(), and no output file is deleted.
2. Utilities.handleDirectInsertTableFinalPath() is called again
- manifest directory has been deleted, so manifests[] remains empty.
- directInsertDirectories[] is computed okay
- committed[] remains empty.
- directInsertDirectories[] is inspected against committed[] in
cleanDirectInsertDirectory(), and every output file is deleted because
commited[] is empty.
This patch works okay when tested with HDFS, but it shows the above behavior
when tested with S3. (However, this result does not necessarily indicate a bug
in this patch because I did not use Tez as the execution engine.)
> ACID: explore how we can avoid a move step during inserts/compaction
> --------------------------------------------------------------------
>
> Key: HIVE-21164
> URL: https://issues.apache.org/jira/browse/HIVE-21164
> Project: Hive
> Issue Type: Bug
> Components: Transactions
> Affects Versions: 3.1.1
> Reporter: Vaibhav Gumashta
> Assignee: Marta Kuczora
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21164.1.patch, HIVE-21164.10.patch,
> HIVE-21164.11.patch, HIVE-21164.11.patch, HIVE-21164.12.patch,
> HIVE-21164.13.patch, HIVE-21164.14.patch, HIVE-21164.14.patch,
> HIVE-21164.15.patch, HIVE-21164.16.patch, HIVE-21164.17.patch,
> HIVE-21164.18.patch, HIVE-21164.19.patch, HIVE-21164.2.patch,
> HIVE-21164.20.patch, HIVE-21164.21.patch, HIVE-21164.22.patch,
> HIVE-21164.3.patch, HIVE-21164.4.patch, HIVE-21164.5.patch,
> HIVE-21164.6.patch, HIVE-21164.7.patch, HIVE-21164.8.patch, HIVE-21164.9.patch
>
>
> Currently, we write compacted data to a temporary location and then move the
> files to a final location, which is an expensive operation on some cloud file
> systems. Since HIVE-20823 is already in, it can control the visibility of
> compacted data for the readers. Therefore, we can perhaps avoid writing data
> to a temporary location and directly write compacted data to the intended
> final path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)