On 4/30/2018 11:47 AM, Paul Sandoz wrote:
On Apr 27, 2018, at 4:30 AM, Alan Bateman <alan.bate...@oracle.com> wrote:
On 27/04/2018 05:51, Joe Wang wrote:
Hi,
Considering extending isSameFile to add isSameContent to Files. Please review.
JBS: https://bugs.openjdk.java.net/browse/JDK-8202285
webrev: http://cr.openjdk.java.net/~joehw/jdk11/8202285/webrev/
specdiff:
http://cr.openjdk.java.net/~joehw/jdk11/8202285/specdiff/java/nio/file/Files.html
I assume we should ignore the implementation for now as the eventual
implementation won't use readAllBytes (at least not for for large files).
Yes, as long as we don’t forget to follow up on a replacement (using memory
mapped files say).
True, updated now :-)
The existing isSameFile is specified as "Tests if two paths locate the same file" and it
would be good if the new method could be somewhat consistent with that, e.g. "Tests if the
content of two files is identical".
Specifying that two path that locate the same file always returns true is
reasonable. This could be make clearer by say that the returning always returns
true when path and path2 are equals, if event if the file does not exist.
The @return should say that it returns true if path and path2 locate the same
file or the content of both files is identical.
The javadoc for SecurityException has "to the file", I assume this should be
"to both files”.
We might also want to say the contents of the two files are assumed to be held
constant during the operation.
Added a statement.
—
It’s tempting (well to me at least) to generalize to a mismatch method (like
for arrays) returning the mismatching location in bytes, then you can determine
if one file is a prefix of another given the files sizes. Bound accepting
methods would also be useful to mismatch on partial content (including within
the same file). If you use memory mapped files we can use direct byte buffers
to efficiently perform the mismatch.
Are there real-life use cases? It may be useful for example to check if
the files have the same header.
We did a bit of use-case study where we compared a bunch of possible
options, including read string with bound, or by specifying patterns,
and/or read into a list with a regex/pattern as separator (vs the
default line-separator). We concluded that readString is a popular
demand, and it's usually a quick read of small files, e.g. a config
file, a SQL query file and etc. The methods fulfill the process of
String <==> File transformation, a straight and quick way of converting
a String to File and vice versa.
The demand for isSameContent isn't necessarily as popular as readString,
but there were still some real use cases where people asked how to do it
quickly. When we have String <==> File, it's natural to at least have a
comparison method since String.equal is essential to it. Plus, we
already had isSameFile.
Best,
Joe
To Remi’s point this might dissuade/guide developers from using this method
when there are other more efficient techniques available when operating at
larger scales. However, it is unfortunately harder that it should be in Java to
hash the contents of a file, a byte[] or ByteBuffer, according to some chosen
algorithm (or a good default).
Paul.