nurpax commented on issue #40800:
URL: https://github.com/apache/arrow/issues/40800#issuecomment-2022146767
> IMHO, this looks hacky to me as it depends on the implementation detail of
the underlying file system
It's a little more complicated than that in practice. The whole processing
work is split into thousands of Slurm jobs which individually can take days to
weeks to complete. The jobs may fail due to a number of reasons: canceled by
admins, OOM, programming errors, networking, intermittent I/O errors. I need
to keep track of the overall process progress somehow and currently this state
is persisted in the filesystem like explained above.
I can do this safely with the local file system, but as I was planning to
save the parquets to s3, I noticed that I may not be able to rely on atomic
operation here.
For reference, here's how I currently save parquets on the filesystem so
that the result is either a complete parquet or nothing at all:
```
def save_parquet(self, pq_path: str, batches: list[pa.RecordBatch],
schema: pa.Schema):
logging.info(f'Saving database in {pq_path}, {len(batches)} batches')
try:
tmp_fd, tmp_name = tempfile.mkstemp(dir=os.path.dirname(pq_path))
with os.fdopen(tmp_fd, 'wb') as tmpf:
with pq.ParquetWriter(tmpf, schema) as writer:
for batch in batches:
writer.write_batch(batch)
os.rename(tmp_name, pq_path)
except:
logging.exception(f'parquet write to {pq_path} failed')
os.unlink(tmp_name)
```
The idea being that the parquet is written to a temp file and should there
be any failures along the way, the result parquet will be deleted. Otherwise
the temp file gets moved to the final parquet file.
As far as I know, an object store like s3 should be able to handle this kind
of "atomic write" well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]