[GitHub] [hive] difin commented on a diff in pull request #3559: HIVE-26496,Improvement to fetch operator to scan only delete_delta fo…

GitBox Wed, 14 Sep 2022 12:35:03 -0700


difin commented on code in PR #3559:
URL: https://github.com/apache/hive/pull/3559#discussion_r971208288



##########
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java:
##########
@@ -104,13 +104,32 @@ public OrcSplit(Path path, Object fileId, long offset, 
long length, String[] hos
     this.isOriginal = isOriginal;
     this.hasBase = hasBase;
     this.rootDir = rootDir;
-    this.deltas.addAll(filterDeltasByBucketId(deltas, 
AcidUtils.parseBucketId(path)));
+    this.deltas.addAll(filterDeleteDeltasByWriteIds
+            (filterDeltasByBucketId(deltas, AcidUtils.parseBucketId(path)), 
conf));
     this.projColsUncompressedSize = projectedDataSize <= 0 ? length : 
projectedDataSize;
     // setting file length to Long.MAX_VALUE will let orc reader read file 
length from file system
     this.fileLen = fileLen <= 0 ? Long.MAX_VALUE : fileLen;
     this.syntheticAcidProps = syntheticAcidProps;
   }
 
+  /**
+   * For every split we want to filter out the delete deltas that contain 
events that happened only
+   * in the past relative to the split
+   * @param deltas
+   * @param conf
+   * @return
+   */
+  protected List<AcidInputFormat.DeltaMetaData> filterDeleteDeltasByWriteIds(
+          List<AcidInputFormat.DeltaMetaData> deltas, Configuration conf) 
throws IOException {
+
+    AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+            AcidUtils.parseBaseOrDeltaBucketFilename(getPath(), conf);

Review Comment:
   Hi @deniskuzZ,
   Many tests failed with the change of using 
AcidUtils.ParsedDeltaLight.parse() instead of 
AcidUtils.parseBaseOrDeltaBucketFilename() with exception on this line: 
   
https://github.com/apache/hive/blob/e352684d5c87df1483444afc4c3ee897270bd413/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L108
   
   As I understand the split is not always a delta folder, it can be some older 
format not supported by ParsedDeltaLight. I saw that ParsedDeltaLight.parse() 
is used in some cases internally in AcidUtils.parseBaseOrDeltaBucketFilename(), 
but not always. Can you please advise if I should revert to using 
AcidUtils.parseBaseOrDeltaBucketFilename() that worked in all cases or there is 
some better way?
   
   
https://github.com/apache/hive/blob/f6bd0eb80767adfa9ce9f47a6d02a4940903effb/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L538-L552



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] difin commented on a diff in pull request #3559: HIVE-26496,Improvement to fetch operator to scan only delete_delta fo…

Reply via email to