[jira] [Created] (HUDI-3709) Writing through Spark does not respect Parquet Max File Size setting

Alexey Kudinkin (Jira) Thu, 24 Mar 2022 14:45:06 -0700

Alexey Kudinkin created HUDI-3709:
-------------------------------------

             Summary: Writing through Spark does not respect Parquet Max File 
Size setting
                 Key: HUDI-3709
                 URL: https://issues.apache.org/jira/browse/HUDI-3709
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin



Currently writing t/h Spark DataSource connector, does not respect "

"hoodie.parquet.max.file.size" setting: in the snippet pasted below i'm trying 
to limit the file-size to 16Mb, while on disk i'm getting ~80Mb files.

 

The reason for that seem to be that we rely on ParquetWriter to control the 
file size (`canWrite` method), that relies in turn on FileSystem to trace how 
much was actually written to FS.

 

The problem with this approach is that Spark is writing {*}lazily{*}:

It creates instances of ParquetWriter which in turn cache the whole record 
group when `write` methods are invoked and only flushes the data to FS only 
_when closing_ the Writer (ie when `close`) is invoked



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3709) Writing through Spark does not respect Parquet Max File Size setting

Reply via email to