yihua opened a new pull request, #5840:
URL: https://github.com/apache/hudi/pull/5840
## What is the purpose of the pull request
When reading the metadata table directly with the metadata table path in
Spark, i.e.,
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/").show`, it
throws `NullPointerException` from `getLogRecordScanner`:
```
Caused by: java.lang.NullPointerException
at
org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:484)
at
org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:342)
at
org.apache.hudi.HoodieMergeOnReadRDD$LogFileIterator.<init>(HoodieMergeOnReadRDD.scala:173)
at
org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.<init>(HoodieMergeOnReadRDD.scala:252)
at
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
```
The root cause is that, in `HoodieMergeOnReadRDD.scanLog`,
`tableState.metadataConfig` does not have `hoodie.metadata.enable` set to
`true` by default. Thus, `HoodieBackedTableMetadata` instantiated based on the
config does not properly initialize the `metadataMetaClient`, causing NPE. In
this use case, given that user explicitly specifies metadata table path for
reading, the `hoodie.metadata.enable` should be overwritten to `true` for
proper read behavior.
## Brief change log
- In `HoodieMergeOnReadRDD.scanLog`, rebuild the `HoodieMetadataConfig`
with `hoodie.metadata.enable` set to `true`
- Fix `TestMetadataTableWithSparkDataSource` to follow the common pattern
for reading metadata table, i.e.,
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/")`, without
setting any options
## Verify this pull request
Before this PR, `TestMetadataTableWithSparkDataSource` fails with
`spark.read.format("hudi").load("<base_path>/.hoodie/metadata/")`. After this
PR, the test class passes. The spark read of metadata table is also verified
with Spark 2.4.4, 3.1.3, and 3.2.1 locally and on S3.
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]