Alexey Kudinkin created HUDI-3709:
-------------------------------------
Summary: Writing through Spark does not respect Parquet Max File
Size setting
Key: HUDI-3709
URL: https://issues.apache.org/jira/browse/HUDI-3709
Project: Apache Hudi
Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
Currently writing t/h Spark DataSource connector, does not respect "
"hoodie.parquet.max.file.size" setting: in the snippet pasted below i'm trying
to limit the file-size to 16Mb, while on disk i'm getting ~80Mb files.
The reason for that seem to be that we rely on ParquetWriter to control the
file size (`canWrite` method), that relies in turn on FileSystem to trace how
much was actually written to FS.
The problem with this approach is that Spark is writing {*}lazily{*}:
It creates instances of ParquetWriter which in turn cache the whole record
group when `write` methods are invoked and only flushes the data to FS only
_when closing_ the Writer (ie when `close`) is invoked
--
This message was sent by Atlassian Jira
(v8.20.1#820001)