Re: [PR] feat: introduce pk filter to log file [hudi]

via GitHub Thu, 06 Nov 2025 23:03:28 -0800


TheR1sing3un commented on code in PR #14205:
URL: https://github.com/apache/hudi/pull/14205#discussion_r2501915239



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##########
@@ -78,16 +80,16 @@ class 
SparkFileFormatInternalRowReaderContext(baseFileReader: SparkColumnarFileR
       assert(getRecordContext.supportsParquetRowIndex())
     }
     val structType = HoodieInternalRowUtils.getCachedSchema(requiredSchema)
+    val (readSchema, readFilters) = getSchemaAndFiltersForRead(structType, 
hasRowIndexField)

Review Comment:
   > I see the filters also include `requiredFilters`, can you investigate a 
little more what it is for MOR reading.
   
   Thanks, Danny, you are right, I delved further into the logic related to 
requiredFilter in mor reading. Only when we perform incremental mor reading 
will this `requiredFilters` have an actual value. It uses the metadata column 
of `_hoodie_commit_time` for data filtering, which does not meet the conditions 
of a primary key, **but it** doesn't matter, pushing down this filter does not 
affect the correctness of our data. Due to the particularity of the commit time 
itself, we can push this filter down at any time
   <img width="868" height="219" alt="image" 
src="https://github.com/user-attachments/assets/4d37bbee-954a-422a-b8ce-affe7b442f67";
 />
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: introduce pk filter to log file [hudi]

Reply via email to