[ 
https://issues.apache.org/jira/browse/HUDI-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7583.
-----------------------------
    Resolution: Fixed

> Read log block header only for the schema and instant time
> ----------------------------------------------------------
>
>                 Key: HUDI-7583
>                 URL: https://issues.apache.org/jira/browse/HUDI-7583
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.15.0, 1.0.0
>
>
> The TableSchemaResolver reads the schema from the log file header.  The 
> current way of instantiating log reader does not lazily read the content, 
> causing the whole block content to be read, which is unnecessary.  This 
> causes the OOM on the Spark driver during clustering when clustering rewrites 
> a file group that contains log files, which requires deriving the schema from 
> the file group in the current logic.
> {code:java}
>   public static MessageType readSchemaFromLogFile(FileSystem fs, Path path) 
> throws IOException {
>     try (Reader reader = HoodieLogFormat.newReader(fs, new 
> HoodieLogFile(path), null)) {
>       HoodieDataBlock lastBlock = null;
>       while (reader.hasNext()) {
>         HoodieLogBlock block = reader.next();
>         if (block instanceof HoodieDataBlock) {
>           lastBlock = (HoodieDataBlock) block;
>         }
>       }
>       return lastBlock != null ? new 
> AvroSchemaConverter().convert(lastBlock.getSchema()) : null;
>     }
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to