[GitHub] [hudi] bhasudha commented on a change in pull request #1817: [HUDI-651] Fix incremental queries in MOR tables

GitBox Wed, 05 Aug 2020 11:41:09 -0700


bhasudha commented on a change in pull request #1817:
URL: https://github.com/apache/hudi/pull/1817#discussion_r465927669




##########
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
##########
@@ -165,11 +261,15 @@ private static void 
cleanProjectionColumnIds(Configuration conf) {
     LOG.info("Creating record reader with readCols :" + 
jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
         + ", Ids :" + 
jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
     // sanity check
-    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit,
+    ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit || 
split instanceof HoodieMORIncrementalFileSplit,

Review comment:
       @satishkotha  There are few requirements we need to satisfy in order to 
support this in HoodieRealtimeFileSplit:
   
   - The start and end time should be honored by the incremental query. If end 
time is not specified then it can be assumed to be minCommit from 
(maxNumberrOfCommits, mostRecentCommit). Currently this is not happening as 
intended.  
   - The base file and log files can be optional. This can be the case when the 
boundaries of incremental query filter is such that the start commit time 
matches a log file and/or an end commit time matches only the base file across 
file slices. Or the incremental query is touching a FileSlice that is not 
compacted yet.
   
   When I initially started, I was not sure how big the refactor and testing it 
would be to achieve both of the above requirements in the same 
HoodieRealtimeFileSplit. This would also require regression testing of snapshot 
queries in all query engines and new incremental query path in all query 
engines. So instead of impacting the snapshot queries code path that is running 
fine, conservatively, I branched out to make these changes only applicable to 
incremental query path and intended to consolidate them in long term after 
stabilizing and gaining more confidence.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a change in pull request #1817: [HUDI-651] Fix incremental queries in MOR tables

Reply via email to