> From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
>
> hdfs doesn't allow random overwrites or appends. so even if
> hdfs were mountable - i am guessing we couldn't just do a
> rsync to a dfs mount (never looked at rsync code - but
> assuming it does appends/random-writes). any emulation of
> rsync would end up having to delete and recreate changed
> files in hdfs.


Thanks for the reply.  Most of the functions I've used rsync for are probably 
compatible with this... I believe the default is to create a .hidden file, 
write and close it, then rename it to the final name.  So, if someone were to 
take certain filesystem calls and replace them with hdfs api, it would probably 
work seamlessly for most users.

I know rsync has a partial-checksum type of feature where the file already 
exists on the destination, so instead of transferring the whole thing, it 
somewhat intelligently determines what blocks have changed and only sends 
those.  I admit that I don't actually know whether it writes a second file in 
this case or not.  For my purposes I would be fine with just disabling certain 
features that modify files... I probably wouldn't even go to the trouble of 
writing my way around it.

> If your data/processing is mostly on log files - replication
> to hdfs can take advantage of some strong assumptions (file
> only changes at end, can convert one file to multiple files
> as long as mapping can be inferred easily).

Excellent... that confirms what I was thinking.  Our app is mostly small files 
(like 1M) but they are almost all write-once read-many... it's extremely rare 
to replace a file after it's written and even then it's almost always a 
complete replacement.

Thanks again!

Reply via email to