[
https://issues.apache.org/jira/browse/AVRO-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17254594#comment-17254594
]
Michael A. Smith commented on AVRO-3013:
----------------------------------------
Oh, I see you're really racing the filesystem here. DataFileWriter takes a
file-like writer. I could see this going a few ways:
h2. Method on DataFileWriter:
Just wrap os.fsync, which users can call if they want to.
{code:python}
class DataFileWriter(_DataFile):
…
def fsync(self):
try:
os.fsync(self.writer)
except TypeError:
raise DataFileException(f"Cannot fsync a writer of type
{type(self.writer)}")
{code}
Users would need to handle the crash if they pass a writer that cannot fsync.
h2. Configuration on DataFileWriter:
Add optional arguments to the constructor so the DataFileWriter fsyncs
automatically when the user wants it to.
{code:python}
class DataFileWriter(_DataFile):
def __init__(self, writer, datum_writer, writers_schema=None,
codec=NULL_CODEC, fsync_on_flush=False, fsync_on_close=False):
self.fsync_on_flush = fsync_on_flush
self.fsync_on_close = fsync_on_close
…
def flush(self):
"""Flush the current state of the file, including metadata."""
self._write_block()
self.writer.flush()
if self.fsync_on_flush:
self.fsync()
def close(self):
"""Close the file."""
self.flush()
if self.fsync_on_close:
self.fsync()
self.writer.close()
{code}
h2. Leave it up to the writer
All of this can be implemented without changing the avro code at all, if the
writer object passed in does it:
{code:python}
class FsyncIO(io.FileIO):
"""A FileIO object that fsyncs on close."""
def close(self):
super().flush()
os.fsync(self)
super().close()
class UnbufferedFsyncIO(io.FileIO):
"""A FileIO object that fsyncs on flush."""
def flush(self):
super().flush()
os.fsync(self)
dfw = avro.datafile.DataFileWriter(FsyncIO('my/file.avro', 'w'), …)
{code}
In my opinion the third option is best. Avro's domain is not handling low-level
file operations and we should try to avoid wrapping them in our code whenever
we can. I recognize that the Java implementation did this, but that could be
just differences in how flexibly these things could be implemented in Python
and Java, back then.
But maybe I'm still missing something. Is there a compelling reason why this
should be implemented in avro itself, instead of in the writer?
> Avro files should allow fsync-ing files to disk in Python
> ---------------------------------------------------------
>
> Key: AVRO-3013
> URL: https://issues.apache.org/jira/browse/AVRO-3013
> Project: Apache Avro
> Issue Type: New Feature
> Components: python
> Reporter: He Chen
> Priority: Major
>
> I am new to Apache, but here I am...
> In our use case, we need to constantly update an existing avro file. The way
> we did it is that we copy the old avro file to a temporary file, append data
> to the temporary file, close the temporary file, and rename the temporary
> file to the original avro file. This is problematic since closing a file does
> not guarantee to write data to disk. The bug caused by this is hard to track
> since it's hard to reproduce.
> I noticed that there is a ticket that addresses this for the Java client
> https://issues.apache.org/jira/browse/AVRO-1388. Why isn't it implemented for
> the Python client? If there are no objections, I'd like to submit a patch. Or
> perhaps I am missing something here? Please let me know!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)