On 17.05.2012 22:26, Brad McEvoy wrote:
Hi Brad,

thanks for your interesting feedback! I think your post did not make it to the mailinglist, but I'll forward it with this answer.


I'm not a developer on OwnCloud, but i did a dotcom startup a while back
trying to be a file sync service like dropbox, but was a bit late to the
party

I'm now converting that to an open source project (see
https://github.com/Spliffy/spliffy - similar goals, but much less
advanced then owncloud, java based). I posted to this list a few months
back suggesting that we share experience and work towards a standards
based and interoperable toolset. I think standards and interopability
would generally strengthen the open source offerings as opposed to the
closed source services currently proliferating.
Yes, standards are good. And I tried to stay as tight to WebDAV as possible yet to keep the door open for interoperability.

Regarding your question below I'd like to share my experience. I first
implemented path based sync, as you have done. I have since come to
believe this is far from optimal. And others from mature and established
sync product companies share that view.

What git does, and i think this is a good model for any sync tool, is
calculate hashes (ie checksums) for files and for directories. Where the
hash for a directory is the checksum of a formatted list of its members
names and hashes. This means that the root folder has a hash which
uniquely identifies the current state of everything inside it. The
client can calculate the same hash for its contents. So, to check if
files are in sync you simply compare the hash of the root directories on
client and server. If they are different you walk down the directory
tree, ignoring directories that have the same hash on client and server,
and locating changed items based on their relative checksums. This is
very fast, very efficient, and very very robust. Its easy to integrate
into a webdav server as its just an extra propery in PROPFIND or header
in a HEAD response. It requires server support so that any change to any
resource results in updated hashes right up to the syncronisation root.

I understand the concept and indeed its good. It's very near to what I want to implement, with the only difference that instead of the hash sums, I'd like to use the mtimes, as csync does. Why do we think thats a benefit: Well, based on the mtimes its decideable which version is newer. Moreover, the mtime is already a natural meta data in each file system, so we do not have to add something new. That given, csync runs without server support by now.

What is missing is the propagation of the mtimes from individual files and directories to their parent directory. If we do that with the ownCloud server support, I think we will have the same benefits that you described above. As we have the data in a database server side we will be able to retrieve the data fast.


Note that there is a related RFC - http://tools.ietf.org/html/rfc6578 -
however I'm not confident that the approach outlined there is quite right.
Do you know if its implemented in a WebDAV server already?

Of course finding what files are new or updated is one thing,
communicating those changes efficiently is another. Spliffy uses a
similar approach to Bup (https://github.com/apenwarr/bup) to split files
into blobs which are stable with respect to file changes. Only changed
blobs are transmitted.

The hashsplitting algorithm is **very** simple, and if you're not doing
something like this yet i suggest you take a peek -
https://github.com/HashSplit4J/hashsplit-lib
Thats cool and is a problem we also still have on our list to tackle.
I stumbled over this already and wonder if there is a C or C++ lib for that.

Sorry for the long post, and I hope this is of some assistance.
Great, I really appreciate your input.

Best,

Klaas


On 17/05/2012 9:12 p.m., Klaas Freitag wrote:
Hi,

one of the biggest shortcomings of the sync client currently is that
it does a full scan of its the ownCloud directories via webdav to
query the last modified times. That causes load and other trouble. It
would be great to find out if something has changed server side more
cheaply.

We have the file system cache which also has the mod times in the
database. My idea is now, instead of querying every single file, I
just issue a HEAD request on the top sync directory and get the latest
modtime of all files in that dir back. If that is younger than the one
I know, I have to do a sync.

I know that it could be even more cool, ie. delivering the list of
files back etc. but lets do small steps. Doing just one HEAD instead
of querying the whole tree already will be great.

The implementation seems easy: Just get all database id's of the
fscache table entries below the top directory of the sync dir and do
kind of
SELECT MAX(mtime) FROM fscache WHERE id in ( list-of-all-ids-in dir );
That should be fast enough.

My question now is: How do we do that? Should we have another app
called /files/sync? Or do we want to enhance the WebDAV server to be
able to do the described logic if a HEAD request on a dir comes in?

I think the latter is more "within the concept" of doing the sync via
WebDAV, OTOH a sync app could be useful anyway for other sync related
server support.

What do you think?

Thanks,

Klaas
_______________________________________________
Owncloud mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/owncloud

_______________________________________________
Owncloud mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/owncloud

Reply via email to