[I] How are errors handled when using pq.ParquetWriter to write a bunch of batches [arrow]

via GitHub Tue, 26 Mar 2024 09:11:00 -0700


nurpax opened a new issue, #40800:
URL: https://github.com/apache/arrow/issues/40800


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Suppose I have some code like:
   
   ```
   parquet_path = f"dbg/index.parquet"
   with pq.ParquetWriter(parquet_path, schema, compression='snappy', 
filesystem=s3_fs) as w:
   
   for batch in batches:
       b = pa.RecordBatch.from_pydict(batch, schema=schema)
       w.write_batch(b)
   ```
   
   What if an exception is thrown in the "batches" loop, say the program is 
sigkilled or the write fails.  Will the result file behind the filesystem be 
partially written?
   
   I'm asking because on a local file system I guess I'd expect this to be a 
corrupt file.  In these cases I write to a tmp file and rename on successful 
writes.
   
   But I'm now planning on writing the parquet to s3.  Will the data be flushed 
to s3 atomatically, ie., I can trust that either a successful write creates a 
new object or nothing at all?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] How are errors handled when using pq.ParquetWriter to write a bunch of batches [arrow]

Reply via email to