[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

ASF GitHub Bot (Jira) Thu, 30 Jul 2020 13:25:13 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23956?focusedWorklogId=464725&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-464725
 ]


ASF GitHub Bot logged work on HIVE-23956:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Jul/20 20:24
            Start Date: 30/Jul/20 20:24
    Worklog Time Spent: 10m 
      Work Description: pvargacl commented on a change in pull request #1339:
URL: https://github.com/apache/hive/pull/1339#discussion_r463249932



##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -1574,45 +1576,46 @@ public int compareTo(CompressedOwid other) {
       this.orcSplit = orcSplit;
 
       try {
-        final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
-        if (deleteDeltaDirs.length > 0) {
+        if (orcSplit.getDeltas().size() > 0) {
           AcidOutputFormat.Options orcSplitMinMaxWriteIds =
               AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), 
conf);
           int totalDeleteEventCount = 0;
-          for (Path deleteDeltaDir : deleteDeltaDirs) {
-            if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, 
deleteDeltaDir)) {
-              continue;
-            }
-            Path[] deleteDeltaFiles = 
OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
-                new OrcRawRecordMerger.Options().isCompacting(false), null);
-            for (Path deleteDeltaFile : deleteDeltaFiles) {
-              try {
-                ReaderData readerData = getOrcTail(deleteDeltaFile, conf, 
cacheTag);
-                OrcTail orcTail = readerData.orcTail;
-                if (orcTail.getFooter().getNumberOfRows() <= 0) {
-                  continue; // just a safe check to ensure that we are not 
reading empty delete files.
-                }
-                OrcRawRecordMerger.KeyInterval deleteKeyInterval = 
findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
-                if (!deleteKeyInterval.isIntersects(keyInterval)) {
-                  // If there is no intersection between data and delete 
delta, do not read delete file
-                  continue;
-                }
-                // Reader can be reused if it was created before for getting 
orcTail: mostly for non-LLAP cache cases.
-                // For LLAP cases we need to create it here.
-                Reader deleteDeltaReader = readerData.reader != null ? 
readerData.reader :
-                    OrcFile.createReader(deleteDeltaFile, 
OrcFile.readerOptions(conf));
-                totalDeleteEventCount += deleteDeltaReader.getNumberOfRows();
-                DeleteReaderValue deleteReaderValue = new 
DeleteReaderValue(deleteDeltaReader,
-                    deleteDeltaFile, readerOptions, bucket, validWriteIdList, 
isBucketedTable, conf,
-                    keyInterval, orcSplit);
-                DeleteRecordKey deleteRecordKey = new DeleteRecordKey();
-                if (deleteReaderValue.next(deleteRecordKey)) {
-                  sortMerger.put(deleteRecordKey, deleteReaderValue);
-                } else {
-                  deleteReaderValue.close();
+          for (AcidInputFormat.DeltaMetaData deltaMetaData : 
orcSplit.getDeltas()) {
+            for (Path deleteDeltaDir : 
deltaMetaData.getPaths(orcSplit.getRootDir())) {
+              if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, 
deleteDeltaDir)) {
+                LOG.debug("Skipping delete delta dir {}", deleteDeltaDir);
+                continue;
+              }
+              for (AcidInputFormat.DeltaFileMetaData fileMetaData : 
deltaMetaData.getDeltaFiles()) {
+                Path deleteDeltaFile = fileMetaData.getPath(deleteDeltaDir, 
bucket);
+                try {
+                  ReaderData readerData = getOrcTail(deleteDeltaFile, conf, 
cacheTag, fileMetaData.getFileId(deleteDeltaDir, bucket));
+                  OrcTail orcTail = readerData.orcTail;
+                  if (orcTail.getFooter().getNumberOfRows() <= 0) {
+                    continue; // just a safe check to ensure that we are not 
reading empty delete files.
+                  }
+                  OrcRawRecordMerger.KeyInterval deleteKeyInterval = 
findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
+                  if (!deleteKeyInterval.isIntersects(keyInterval)) {
+                    // If there is no intersection between data and delete 
delta, do not read delete file
+                    continue;
+                  }
+                  // Reader can be reused if it was created before for getting 
orcTail: mostly for non-LLAP cache cases.
+                  // For LLAP cases we need to create it here.
+                  Reader deleteDeltaReader = readerData.reader != null ? 
readerData.reader : OrcFile
+                      .createReader(deleteDeltaFile, 
OrcFile.readerOptions(conf));
+                  totalDeleteEventCount += deleteDeltaReader.getNumberOfRows();
+                  DeleteReaderValue deleteReaderValue =
+                      new DeleteReaderValue(deleteDeltaReader, 
deleteDeltaFile, readerOptions, bucket, validWriteIdList,
+                          isBucketedTable, conf, keyInterval, orcSplit);
+                  DeleteRecordKey deleteRecordKey = new DeleteRecordKey();
+                  if (deleteReaderValue.next(deleteRecordKey)) {
+                    sortMerger.put(deleteRecordKey, deleteReaderValue);
+                  } else {
+                    deleteReaderValue.close();
+                  }
+                } catch (FileNotFoundException fnf) {

Review comment:
       Technically we need this, because of the multistatement case is not 
handled too well. There is one DeltaMetaData for one writeId and the 
statementIds are collected there. I did not want to disturb this structure, but 
this way I have one merged fileList for the different folders and it will try 
each file for each folder. This is far from ideal, but I don't think it is 
worth the effort to change this, before the multistatement feature is 
developed. But I will change the comment to reflect that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 464725)
    Time Spent: 2h 20m  (was: 2h 10m)

> Delete delta directory file information should be pushed to execution side
> --------------------------------------------------------------------------
>
>                 Key: HIVE-23956
>                 URL: https://issues.apache.org/jira/browse/HIVE-23956
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Peter Varga
>            Assignee: Peter Varga
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Since HIVE-23840 LLAP cache is used to retrieve the tail of the ORC bucket 
> files in the delete deltas, but to use the cache the fileId must be 
> determined, so one more FileSystem call is issued for each bucket.
> This fileId is already available during compilation in the AcidState 
> calculation, we should serialise this to the OrcSplit, and remove the 
> unnecessary FS calls.
> Furthermore instead of sending the SyntheticFileId directly, we should pass 
> the attemptId instead of the standard path hash, this way the path and the 
> SyntheticFileId. can be calculated, and it will work even, if the move free 
> delete operations will be introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23956) Delete delta directory file information should be pushed to execution side

Reply via email to