[
https://issues.apache.org/jira/browse/HUDI-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sagar Sumit closed HUDI-7583.
-----------------------------
Resolution: Fixed
> Read log block header only for the schema and instant time
> ----------------------------------------------------------
>
> Key: HUDI-7583
> URL: https://issues.apache.org/jira/browse/HUDI-7583
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.15.0, 1.0.0
>
>
> The TableSchemaResolver reads the schema from the log file header. The
> current way of instantiating log reader does not lazily read the content,
> causing the whole block content to be read, which is unnecessary. This
> causes the OOM on the Spark driver during clustering when clustering rewrites
> a file group that contains log files, which requires deriving the schema from
> the file group in the current logic.
> {code:java}
> public static MessageType readSchemaFromLogFile(FileSystem fs, Path path)
> throws IOException {
> try (Reader reader = HoodieLogFormat.newReader(fs, new
> HoodieLogFile(path), null)) {
> HoodieDataBlock lastBlock = null;
> while (reader.hasNext()) {
> HoodieLogBlock block = reader.next();
> if (block instanceof HoodieDataBlock) {
> lastBlock = (HoodieDataBlock) block;
> }
> }
> return lastBlock != null ? new
> AvroSchemaConverter().convert(lastBlock.getSchema()) : null;
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)