nurpax commented on issue #40800:
URL: https://github.com/apache/arrow/issues/40800#issuecomment-2022146767

   > IMHO, this looks hacky to me as it depends on the implementation detail of 
the underlying file system
   
   It's a little more complicated than that in practice.  The whole processing 
work is split into thousands of Slurm jobs which individually can take days to 
weeks to complete.  The jobs may fail due to a number of reasons: canceled by 
admins, OOM, programming errors, networking, intermittent I/O errors.  I need 
to keep track of the overall process progress somehow and currently this state 
is persisted in the filesystem like explained above.  
   
   I can do this safely with the local file system, but as I was planning to 
save the parquets to s3, I noticed that I may not be able to rely on atomic 
operation here.  
   
   For reference, here's how I currently save parquets on the filesystem so 
that the result is either a complete parquet or nothing at all:
   
   ```
       def save_parquet(self, pq_path: str, batches: list[pa.RecordBatch], 
schema: pa.Schema):
           logging.info(f'Saving database in {pq_path}, {len(batches)} batches')
           try:
               tmp_fd, tmp_name = tempfile.mkstemp(dir=os.path.dirname(pq_path))
               with os.fdopen(tmp_fd, 'wb') as tmpf:
                   with pq.ParquetWriter(tmpf, schema) as writer:
                       for batch in batches:
                           writer.write_batch(batch)
               os.rename(tmp_name, pq_path)
           except:
               logging.exception(f'parquet write to {pq_path} failed')
               os.unlink(tmp_name)
   ```
   
   The idea being that the parquet is written to a temp file and should there 
be any failures along the way, the result parquet will be deleted.  Otherwise 
the temp file gets moved to the final parquet file.
   
   As far as I know, an object store like s3 should be able to handle this kind 
of "atomic write" well.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to