Al 04/09/13 18:10, En/na Anand Avati ha escrit:
On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:

    Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:

        On 09/04/2013 04:27 AM, Xavier Hernandez wrote:

            I would also like to note that each node can store
            multiple elements.
            Current implementation creates a node for each byte in the
            key. In my
            implementation I only create a node if there is a prefix
            coincidence between
            2 or more keys. This reduces the number of nodes and the
            number of
            indirections.


        Whatever we do, we should try to make sure that the changes
        are profiled
        against real usage.  When I was making my own dict
        optimizations back in March
        of last year, I started by looking at how they're actually
        used. At that time,
        a significant majority of dictionaries contained just one
        item. That's why I
        only implemented a simple mechanism to pre-allocate the first
        data_pair instead
        of doing something more ambitious.  Even then, the difference
        in actual
        performance or CPU usage was barely measurable.  Dict usage
        has certainly
        changed since then, but I think you'd still be hard pressed to
        find a case
        where a single dict contains more than a handful of entries,
        and approaches
        that are optimized for dozens to hundreds might well perform
        worse than simple
        ones (e.g. because of cache aliasing or branch misprediction).

        If you're looking for other optimization opportunities that
        might provide even
        bigger "bang for the buck" then I suggest that stack-frame or
        frame->local
        allocations are a good place to start.  Or string copying in
        places like
        loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.  Let me
        know and I'll come
        up with a few more.  To put a bit of a positive spin on
        things, the GlusterFS
        code offers many opportunities for improvement in terms of CPU
        and memory
        efficiency (though it's surprisingly still way better than
        Ceph in that regard).

    Yes. The optimizations on dictionary structures are not a big
    improvement in the overall performance of GlusterFS. I tried it on
    a real situation and the benefit was only marginal. However I
    didn't test new features like an atomic lookup and remove if found
    (because I would have had to review all the code). I think this
    kind of functionalities could improve a bit more the results I
    obtained.

    However this is not the only reason to do these changes. While
    I've been writing code I've found that it's tedious to do some
    things just because there isn't such functions in dict_t. Some
    actions require multiple calls, having to check multiple errors
    and adding complexity and limiting readability of the code. Many
    of these situations could be solved using functions similar to
    what I proposed.

    On the other side, if dict_t must be truly considered a concurrent
    structure, there are a lot of race conditions that might appear
    when doing some operations. It would require a great effort to
    take care of all these possibilities everywhere. It would be
    better to pack most of these situations into functions inside the
    dict_t itself where it is easier to combine some operations.

    By the way, I've made some tests with multiple bricks and it seems
    that there is a clear speed loss on directory listings as the
    number of bricks increases. Since bricks should be independent and
    they can work in parallel, I didn't expected such a big
    performance degradation.


The likely reason is that, even though bricks are parallel for IO, readdir is essentially a sequential operation and DHT has a limitation that a readdir reply batch does not cross server boundaries. So if you have 10 files and 1 server, all 10 entries are returned in one call to the app/libc. If you have 10 files and 10 servers evenly distributed, the app/libc has to perform 10 calls and keeps getting one file at a time. This problem goes away when each server has enough files to fill up a readdir batch. It's only when you have too few files and too many servers that this "dilution" problem shows up. However, this is just a theory and your problem may be something else too..

I didn't know that DHT was doing a sequential brick scan on readdir(p) (my fault). Why is that ? Why it cannot return entries crossing a server boundary ? is it due to a technical reason or is it only due to the current implementation ?

I've made a test using only directories (50 directories with 50 subdirectories each). I started with one brick and I measured the time to do a recursive 'ls'. Then I sequentially added an additional brick, up to 6 (all of them physically independent), and repeated the ls. The time increases linearly as the number of bricks augments. As more bricks were added, the rebalancing time was also growing linearly.

I think this is a big problem for scalability. It can be partially hidden by using some caching or preloading mechanisms, but it will be there and it will hit sooner or later.

Note that Brian Foster's readdir-ahead patch should address this problem to a large extent. When loaded on top of DHT, the prefiller effectively collapses the smaller chunks returned by DHT into a larger chunk requested by the app/libc.

I've seen it, however I think it will only partially mitigate and hide an existing problem. Imagine you have some hundreds or a thousand of bricks. I doubt readdir-ahead or anything else can hide the enormous latency that the sequential DHT scan will generate in that case.

The main problem I see is that the full directory structure is read many times sequentially. I think it would be better to do the readdir(p) calls in parallel and combine them (possibly in background). This way the time to scan the directory structure would be almost constant, independently of the number of bricks.

Xavi

Avati


    However the tests have not been exhaustive nor made in best
    conditions so they might be misleading. Anyway it seems to me that
    there might be a problem with some mutexes that force too much
    serialization of requests (though I have no real proves it's only
    a feeling). Maybe some more "asynchronousity" on calls between
    translators could help.

    Only some thoughts...

    Best regards,

    Xavi



        _______________________________________________
        Gluster-devel mailing list
        Gluster-devel@nongnu.org <mailto:Gluster-devel@nongnu.org>
        https://lists.nongnu.org/mailman/listinfo/gluster-devel



    _______________________________________________
    Gluster-devel mailing list
    Gluster-devel@nongnu.org <mailto:Gluster-devel@nongnu.org>
    https://lists.nongnu.org/mailman/listinfo/gluster-devel




_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Reply via email to