Perkeep in generally is very (too) aggressive at fsyncing per blob, and it cuts up files into lots of small blobs, so importing lots of data is slow. There's a plan to fix this, but life (baby) got in the way, so it's kinda on hold until I find a few minutes to think. The two high level plans is to let clients specify transactions implicitly or explicitly: implicitly = one multipart/mime POST of a bunch of blobs is one transaction so should be 1 fsync, not 1 per blob, serially. The more complex one involves API changes and lets clients create their own transactions and associate, say, a whole file or directory upload with that transaction, and then wait on all the associated blobs to be committed (fsynced, or whatever blob storage impl requires) before noting that it's good locally.
As for (2), though, pk-put won't repeat any work it's done. It'll still walk your local filesystem to see what's there, but it'll learn that it's already uploaded from either its local cache or from the server before it uploads chunks again. So it might be slow (throughput wise) but holding 2TB should be no problem, and auto-resume should work. If you run with the pk-put verbose option it'll show lots of stats about where which phases are at. On Fri, May 3, 2019 at 11:13 AM Ian Denhardt <[email protected]> wrote: > Hey All, > > I have about 2TB of files that I'm looking at importing into perkeep. I > have a couple questions. > > First, do others have experience they can share re: how perkeep performs > holding this much data? From what I've read it sounds like > architecturally it should be manageable, but I'd like to know if anyone > can say how that's worked out in practice for them. > > Assuming this is realistic, I have some logistical questions about > getting the data in there in the first place. > > I left a pk-put going on a large sub-tree last night, and came back to > it today. It had spent about 12 hours copying things, finally running in > to some hiccough uploading a particular file (I don't have the error > message recorded, but it was something along the lines of "server did > not receive blob"). Trying to upload that file again worked fine, so I > assume some transient thing. > > During the transfer, usage on the drives holding the blobs grew by about > 80 GiB. This is transferring data between two hard drives connected to > the same machine via USB 3.0. Questions: > > 1. Is that kind of performance normal for pk-put? > 2. Is there currently any way to do a "resumable" version of pk-put, > where it can quickly pick up where it left off? > > If the answer to (2) is no, I might be interested in contributing such a > feature, and would appreciate pointers as to where to start. > > Thanks. > > -Ian > > -- > You received this message because you are subscribed to the Google Groups > "Perkeep" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
