bhasudha commented on a change in pull request #1817:
URL: https://github.com/apache/hudi/pull/1817#discussion_r465927669
##########
File path:
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
##########
@@ -165,11 +261,15 @@ private static void
cleanProjectionColumnIds(Configuration conf) {
LOG.info("Creating record reader with readCols :" +
jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
+ ", Ids :" +
jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
// sanity check
- ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit,
+ ValidationUtils.checkArgument(split instanceof HoodieRealtimeFileSplit ||
split instanceof HoodieMORIncrementalFileSplit,
Review comment:
@satishkotha There are few requirements we need to satisfy in order to
support this in HoodieRealtimeFileSplit:
- The start and end time should be honored by the incremental query. If end
time is not specified then it can be assumed to be minCommit from
(maxNumberrOfCommits, mostRecentCommit). Currently this is not happening as
intended.
- The base file and log files can be optional. This can be the case when the
boundaries of incremental query filter is such that the start commit time
matches a log file and/or an end commit time matches only the base file across
file slices. Or the incremental query is touching a FileSlice that is not
compacted yet.
When I initially started, I was not sure how big the refactor and testing it
would be to achieve both of the above requirements in the same
HoodieRealtimeFileSplit. This would also require regression testing of snapshot
queries in all query engines and new incremental query path in all query
engines. So instead of impacting the snapshot queries code path that is running
fine, conservatively, I branched out to make these changes only applicable to
incremental query path and intended to consolidate them in long term after
stabilizing and gaining more confidence.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]