Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

Derek Homeier Wed, 14 Apr 2021 15:19:13 -0700

On 14 Apr 2021, at 11:15 pm, Robert Kern <robert.k...@gmail.com> wrote:
> 
> On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke <j.wut...@fz-juelich.de> wrote:
> Regarding numpy, I'd propose a bolder measure:
> To let savetxt(fname, X, ...) store exactly the same information in
> compressed and uncompressed files, always invoke gzip with mtime = 0.
> 
> I agree.
>  
I would caution though that relying on the checksum or similar of the 
compressed data still
does not seem a very robust check of the data itself - the compressed file 
would still definitely
change with any change in compression level, and quite possibly with changes in 
the
linked compression library (perhaps even a simple version update).


Shouldn’t you better verify the data buffers themselves? Outside of Python, the 
lzma utility
xzcmp for example allows binary comparison of the content of compressed files 
independently
from the timestamp or even compression method, and it handles gzip and bzip2 
files as well.

> I would like to follow up with a pull request, but I am unable to
> find out how numpy.savetxt is invoking gzip.
> 
> `savetxt` uses the abstractions in this module to invoke gzip when the 
> filename calls for it:
> 
> https://github.com/numpy/numpy/blob/main/numpy/lib/_datasource.py#L115
> 
One obstacle would be that this is setting gzip.open as _file_opener, which 
does not have the
mtime option; getting this replaced with gzip.GzipFile to work in text mode 
would require to
somehow replicate gzip.open’s io.TextIOWrapper wrapping of GzipFile.

Cheers,
                                        Derek

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

Reply via email to