On Mon, May 6, 2019 at 4:24 PM Ian Denhardt <[email protected]> wrote:

> Thanks for the pointers. I've managed to solve the performance issue,
> through two things:
>
> 1. I wrote a simple seccomp wrapper that just silently ignores calls to
>    fsync & sync. Obviously I don't have any intention of using this
>    after the initial import; I'm not crazy. But as expected, this sped
>    things up a lot.
> 2. The bigger difference came from switching to diskpacked storage. Is
>    there a reason this isn't the default?
>

diskpacked is good for write throughput, but not great for reads (often bad
locality). blobpacked (the default) has perfect locality for files and has
fast paths that cut through a bunch of the layering for sequential reads
(e.g. downloading a file that is otherwise in thousands of logical blobs),
but does more work on uploads. When your data grows slowly over time, that
makes sense. When you're mass importing data into it, it's not very ideal.


> It managed to get through copying about 20GiB of data while I was in the
> shower, so I think this solves my immediate issue.
>
> Thanks again,
>
> -Ian
>
> Quoting Brad Fitzpatrick (2019-05-03 15:16:25)
> >    Perkeep in generally is very (too) aggressive at fsyncing per blob,
> and
> >    it cuts up files into lots of small blobs, so importing lots of data
> is
> >    slow. There's a plan to fix this, but life (baby) got in the� way, so
> >    it's kinda on hold until I find a few minutes to think. The two high
> >    level plans is to let clients specify transactions implicitly or
> >    explicitly: implicitly = one multipart/mime POST of a bunch of blobs
> is
> >    one transaction so should be 1 fsync, not 1 per blob, serially. The
> >    more complex one involves API changes and lets clients create their
> own
> >    transactions and associate, say, a whole file or directory upload with
> >    that transaction, and then wait on all the associated blobs to be
> >    committed (fsynced, or whatever blob storage impl requires) before
> >    noting that it's good locally.
> >    As for (2), though, pk-put won't repeat any work it's done. It'll
> still
> >    walk your local filesystem to see what's there, but it'll learn that
> >    it's already uploaded from either its local cache or from the server
> >    before it uploads chunks again.
> >    So it might be slow (throughput wise) but holding 2TB should be no
> >    problem, and auto-resume should work. If you run with the pk-put
> >    verbose option it'll show lots of stats about where which phases are
> >    at.
> >
> >    On Fri, May 3, 2019 at 11:13 AM Ian Denhardt <[1][email protected]>
> >    wrote:
> >
> >      Hey All,
> >      I have about 2TB of files that I'm looking at importing into
> >      perkeep. I
> >      have a couple questions.
> >      First, do others have experience they can share re: how perkeep
> >      performs
> >      holding this much data? From what I've read it sounds like
> >      architecturally it should be manageable, but I'd like to know if
> >      anyone
> >      can say how that's worked out in practice for them.
> >      Assuming this is realistic, I have some logistical questions about
> >      getting the data in there in the first place.
> >      I left a pk-put going on a large sub-tree last night, and came back
> >      to
> >      it today. It had spent about 12 hours copying things, finally
> >      running in
> >      to some hiccough uploading a particular file (I don't have the error
> >      message recorded, but it was something along the lines of "server
> >      did
> >      not receive blob"). Trying to upload that file again worked fine, so
> >      I
> >      assume some transient thing.
> >      During the transfer, usage on the drives holding the blobs grew by
> >      about
> >      80 GiB. This is transferring data between two hard drives connected
> >      to
> >      the same machine via USB 3.0. Questions:
> >      1. Is that kind of performance normal for pk-put?
> >      2. Is there currently any way to do a "resumable" version of pk-put,
> >      �  � where it can quickly pick up where it left off?
> >      If the answer to (2) is no, I might be interested in contributing
> >      such a
> >      feature, and would appreciate pointers as to where to start.
> >      Thanks.
> >      -Ian
> >      --
> >      You received this message because you are subscribed to the Google
> >      Groups "Perkeep" group.
> >      To unsubscribe from this group and stop receiving emails from it,
> >      send an email to [2][email protected].
> >      For more options, visit [3]https://groups.google.com/d/optout.
> >
> >    --
> >    You received this message because you are subscribed to the Google
> >    Groups "Perkeep" group.
> >    To unsubscribe from this group and stop receiving emails from it, send
> >    an email to [4][email protected].
> >    For more options, visit [5]https://groups.google.com/d/optout.
> >
> > Verweise
> >
> >    1. mailto:[email protected]
> >    2. mailto:perkeep%[email protected]
> >    3. https://groups.google.com/d/optout
> >    4. mailto:[email protected]
> >    5. https://groups.google.com/d/optout
>
> --
> You received this message because you are subscribed to the Google Groups
> "Perkeep" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/CAKPOmqMGEFTa%3DojzjrikHqw53rNNqPTsH1Tev-QFzX%2Beg4FF0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to