Thanks, Matt.

I looked around for the algorithm in the src, but couldn't discern how things 
worked myself. Thank you for the explanation and reasoning behind how it's done.

The aws s3 sync algorithm states it operate like so

"A local file will require uploading if the size  of  the  local file  is  
different  than  the size of the s3 object, the last modified time of the local 
file is newer than the last modified time of  the  s3 object, or the local file 
does not exist under the specified bucket and prefix."

Is my interpretation of this right to think aws s3 cli always does a HEAD for 
every object to get that s3 mod date for local mod date comparison? I realize 
the md5 comparison is always ultimately preferable and aws s3 cli doesn't offer 
that.

It's good to know how these 2 tools (what I've come to think of the 2 main cli 
tools for s3) treat and use that s3 mod date differently.

Mike

On Apr 13, 2014, at 10:52 PM, Matt Domsch <m...@domsch.com> wrote:

> The algorithm is in S3/FileLists.py compare_filelists().
> 
> Check if one side has a file (by name) the other doesn't.  If so, there's 
> nothing to compare.
> 
> Check that both files have the same size (as reported by stat() for local 
> files, and in the remote directory listing)
> 
> If checking MD5:
>   calculate or get the MD5, compare the two values
> 
> Date isn't actually compared, because what's in the XML returned by the 
> object listing from S3 doesn't contain the file date, only the 
> "Last-Modified" header, which is when the file was uploaded to S3.
> 
> Date (really, ctime, mtime, atime) as obtained from local files, when 
> --preserve is used (the default) _is_ stored in the x-amz-meta-s3cmd-attrs 
> metadata value for an object.  But getting this value back from S3 requires 
> doing a HEAD on the object itself, for every object, which is really 
> expensive.  So, we don't do that, unless we are comparing MD5s, and the MD5 
> (really, ETag) value returned in the directory listing indicates the file was 
> uploaded using multipart upload, in which case the ETag  value isn't the MD5 
> value for the whole file, but only for the last chunk of the file committed 
> to disk (not necessarily even the MD5 of the last chunk of the file).  We 
> don't, in general, get that value.
> 
> Now, if we are syncing from remote to local, we get x-amz-meta-s3cmd-attrs 
> value "for free" as a header when we do the GET to get the object, so we do 
> use it to set the values back to what were originally stored there.
> 
> So, the manpage is correct, date is not used in the comparison for syncing 
> purposes.  One could argue that the expensive HEAD call is still cheaper than 
> calculating the local MD5 of a file, but we can mitigate the local expense 
> using the --cache-file mechanism such that we only read the local file once 
> and then read its md5 out of the cache until it changes, so  the HEAD isn't 
> cheaper in general.
> 
> To detect a change to a file whose size hasn't changed, but its content has, 
> we have to do the HEAD call, and calculate the MD5 of the local file (and use 
> --cache-file to record that for posterity), and compare.
> 
> 
> 
> 
> On Sun, Apr 13, 2014 at 5:08 PM, WagnerOne <wag...@wagnerone.com> wrote:
> Hi,
> 
> The man page states the following:
> 
> --no-check-md5
>    Do not check MD5 sums when comparing files for [sync].  Only size will be 
> compared. May significantly speed up transfer but may also miss some changed 
> files.
> 
> When this says "only size will be compared", I'm taking it to mean only "size 
> of size and md5" will be compared?
> 
> Date (and name, of course), is still used in addition to size if 
> --no-check-md5 is passed?
> 
> Thanks,
> Mike
> 
> --
> wag...@wagnerone.com
> "I have no complaints, ever, about anything."-"Steve McQueen
> 
> 
> 
> ------------------------------------------------------------------------------
> Put Bad Developers to Shame
> Dominate Development with Jenkins Continuous Integration
> Continuously Automate Build, Test & Deployment
> Start a new project now. Try Jenkins in the cloud.
> http://p.sf.net/sfu/13600_Cloudbees
> _______________________________________________
> S3tools-general mailing list
> S3tools-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/s3tools-general
> 
> 
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/NeoTech_______________________________________________
> S3tools-general mailing list
> S3tools-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/s3tools-general

-- 
wag...@wagnerone.com
"Always consider the possibility your assumptions are wrong."-Wheel of Time



------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general

Reply via email to