On Saturday, January 24, 2015 at 3:00:20 AM UTC+11, Nikolaus Rath wrote:
>
> On 01/21/2015 08:16 PM, Rabid Mutant wrote: 
> > 
> > Does S3QL really need 1 file handle per cache entry? 
>
> In principle, no. The way it's currently programmed, yes. 
>
 
Looking briefly at the code, it seems I might be able to replace access to 
the file handle with a call to a cache manager, and everything should just 
work...but that's based on one quick look. Would that be a correct 
assessment?



> I could use rsync to compare files and update only the new & changed 
> files 
> > without any unnecessary network I/O. It would also allow for the 
> > possibility of offline use. 
>
> rsync by default uses file name, modification time, and size to check if 
> a file has changed, so it won't incur any network IO apart from what's 
> necessary to transfer new and changed files. 


> This changes if you use the -c option, but I'd be rather curious why 
> you'd need that. 
>


Some applications (notably PostgreSQL) do not update inode dates when they 
update files, specifically to reduce IO load. ie. the data is changed, but 
the modification dates (and quite probably size) are not. So the -c option 
becomes important, at least in this case.

I also (sometimes) change the file modification dates on photos to the 
original photo date after trivial edits: eg changing EXIF data. In this 
case the date and size remain the same.


Another factor, and I agree it's probably minor, but decryption is usually 
considerably faster than compression, and my expectation was that using 
'rsync -c' on a fully cached file system (thereby comparing uncompressed 
data) would be faster than compressing the data and comparing the 
checksums. This would need to be verified, but it seems likely. Since the 
following would occur:

Normal Copy:

1. compress chunk
2. compare hash in DB
3. If different:
  a. send compressed from step 1

rsync -c copy with complete cache:

1. decompress chunk from cache
2. compare data
3. if different:
  a. Compress chunk
  b. send  

In the case of completely changed files this will clearly be slower, 
assuming the cache is compressed. But in he case of larger files and file 
systems with few changes, it should be considerably faster.

But the benefit does seem situational.

 

>
> > I guess that this leads to the secondary question(s) is(are): have I 
> > misinterpreted the was S3QL works, and is there a better way to do 
> > minimal I/O sync, and to support an offline mode? 
>
> S3QL is meant to store its data remotely. The easiest way to get an 
> "offline" mode is to explicitly store a local copy of all the data and 
> periodically sync it to the S3QL fs. 
>

Thanks, this is probably what I will do, and just never use PostgreSQL on 
those fs.


 

-- 
You received this message because you are subscribed to the Google Groups 
"s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to