Hi All,

I frequently write large (10+ gig) Parquet files to S3 using the
ParquetWriter class in Python.  These files are written using an S3
multipart upload functionality provided by the underlying S3Filesystem
implementation.  I call S3FileSystem.create_output_stream() and pass that
to the ParquetWriter.

Sometimes while I'm writing the Parquet file in Python, an exception is
raised (you all know things happen).  The problem is that Arrow doesn't
abort the multipart S3 upload that's in the process (which would cancel the
creation of the S3 key); instead, it completes the upload as if everything
is fine.

This behavior means a partially complete Parquet file is written to S3
because the write completes.  That isn't great.

In the ParquetWriter's API, there isn't a way to abort the output of the
Parquet file.  There is only a close() method.  On a local filesystem, it
makes sense to unlink the file (since it will only have the partial data);
on S3, it would be logical to abort the partial write.

I propose expanding the API of ParquetWriter to have an abort() method that
would function as I described above.  I'd also like to change the behavior
of the ParquetWriter context manager to abort all writes when an exception
has been raised in the scope of the ParquetWriter's context.

This would imply changing to look if there was an exception in the context
and calling abort():

https://github.com/apache/arrow/blob/5d61c62008f2f6e02951670f8d0996d90638114c/python/pyarrow/parquet/core.py#L994

Rusty

Reply via email to