yihua opened a new pull request, #5840:
URL: https://github.com/apache/hudi/pull/5840

   ## What is the purpose of the pull request
   
   When reading the metadata table directly with the metadata table path in 
Spark, i.e., 
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/").show`, it 
throws `NullPointerException` from `getLogRecordScanner`:
   ```
   Caused by: java.lang.NullPointerException
      at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:484)
      at 
org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:342)
      at 
org.apache.hudi.HoodieMergeOnReadRDD$LogFileIterator.<init>(HoodieMergeOnReadRDD.scala:173)
      at 
org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.<init>(HoodieMergeOnReadRDD.scala:252)
      at 
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:101)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   ```
   The root cause is that, in `HoodieMergeOnReadRDD.scanLog`, 
`tableState.metadataConfig` does not have `hoodie.metadata.enable` set to 
`true` by default.  Thus, `HoodieBackedTableMetadata` instantiated based on the 
config does not properly initialize the `metadataMetaClient`, causing NPE.  In 
this use case, given that user explicitly specifies metadata table path for 
reading, the `hoodie.metadata.enable` should be overwritten to `true` for 
proper read behavior.
   
   ## Brief change log
   
     - In `HoodieMergeOnReadRDD.scanLog`, rebuild the `HoodieMetadataConfig` 
with `hoodie.metadata.enable` set to `true`
     - Fix `TestMetadataTableWithSparkDataSource` to follow the common pattern 
for reading metadata table, i.e., 
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/")`, without 
setting any options
    
   ## Verify this pull request
   
   Before this PR, `TestMetadataTableWithSparkDataSource` fails with 
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/")`.  After this 
PR, the test class passes.  The spark read of metadata table is also verified 
with Spark 2.4.4, 3.1.3, and 3.2.1 locally and on S3.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to