Alexey Kudinkin created HUDI-3709:
-------------------------------------

             Summary: Writing through Spark does not respect Parquet Max File 
Size setting
                 Key: HUDI-3709
                 URL: https://issues.apache.org/jira/browse/HUDI-3709
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin


Currently writing t/h Spark DataSource connector, does not respect "

"hoodie.parquet.max.file.size" setting: in the snippet pasted below i'm trying 
to limit the file-size to 16Mb, while on disk i'm getting ~80Mb files.

 

The reason for that seem to be that we rely on ParquetWriter to control the 
file size (`canWrite` method), that relies in turn on FileSystem to trace how 
much was actually written to FS.

 

The problem with this approach is that Spark is writing {*}lazily{*}:

It creates instances of ParquetWriter which in turn cache the whole record 
group when `write` methods are invoked and only flushes the data to FS only 
_when closing_ the Writer (ie when `close`) is invoked



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to