Hi All, I frequently write large (10+ gig) Parquet files to S3 using the ParquetWriter class in Python. These files are written using an S3 multipart upload functionality provided by the underlying S3Filesystem implementation. I call S3FileSystem.create_output_stream() and pass that to the ParquetWriter.
Sometimes while I'm writing the Parquet file in Python, an exception is raised (you all know things happen). The problem is that Arrow doesn't abort the multipart S3 upload that's in the process (which would cancel the creation of the S3 key); instead, it completes the upload as if everything is fine. This behavior means a partially complete Parquet file is written to S3 because the write completes. That isn't great. In the ParquetWriter's API, there isn't a way to abort the output of the Parquet file. There is only a close() method. On a local filesystem, it makes sense to unlink the file (since it will only have the partial data); on S3, it would be logical to abort the partial write. I propose expanding the API of ParquetWriter to have an abort() method that would function as I described above. I'd also like to change the behavior of the ParquetWriter context manager to abort all writes when an exception has been raised in the scope of the ParquetWriter's context. This would imply changing to look if there was an exception in the context and calling abort(): https://github.com/apache/arrow/blob/5d61c62008f2f6e02951670f8d0996d90638114c/python/pyarrow/parquet/core.py#L994 Rusty