On 04/11/2018 03:29 PM, Brian Bouterse wrote:
I think we should look into this in the near-term. Changing an interface on an object used by all plugins will be significantly easier, earlier.

On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jor...@redhat.com <mailto:jor...@redhat.com>> wrote:

    On 04/11/2018 10:59 AM, Brian Bouterse wrote:

    On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jor...@redhat.com
    <mailto:jor...@redhat.com>> wrote:





        On 04/06/2018 09:15 AM, Brian Bouterse wrote:
        Several plugins have started using the Changesets including
        pulp_ansible, pulp_python, pulp_file, and perhaps others.
        The Changesets provide several distinct points of value
        which are great, but there are two challenges I want to
        bring up. I want to focus only on the problem statements first.

        1. There is redundant "differencing" code in all plugins.
        The Changeset interface requires the plugin writer to
        determine what units need to be added and those to be
        removed. This requires all plugin writers to write the same
        non-trivial differencing code over and over. For example,
        you can see the same non-trivial differencing code present
        in pulp_ansible
        and pulp_python
        Line-wise, this "differencing" code makes up a large portion
        (maybe 50%) of the sync code itself in each plugin.

        Ten lines of trivial set logic hardly seems like a big deal
        but any duplication is worth exploring.

    It's more than ten lines. Take pulp_ansible for example. By my
    count (the linked to section) it's 89 lines, which out of 306
    lines of plugin code for sync is 29% of extra redundant code. The
    other plugins have similar numbers. So with those numbers in
    mind, what do you think?

    I was counting the lines (w/o comments) in find_delta() based on
    the linked code.  Which functions are you counting?

I was counting the find_delta, build_additions, and build_removals methods. Regardless of how the lines are counted, that differencing code is the duplication I'm talking about. There isn't a way to use the changesets without duplicating that differencing code in a plugin.

The differencing code is limited to find_delta() and perhaps build_removals().  Agreed, the line count is less useful than specifically identifying duplicate code.  Outside of find_delta(), I see similar code (in part because it got copied from file plugin) but not seeing actual duplication.  Can you be more specific?

So a shorter, simpler problem statement is: "to use the changesets plugin writers have to do extra work to compute additions and removals parameters".

This statement ^ is better but still too vague to actually solve. Can we elaborate on specifically what "to do extra work" means?

        2. Plugins can't do end-to-end stream processing. The
        Changesets themselves do stream processing, but when you
        call into changeset.apply_and_drain() you have to have fully
        parsed the metadata already. Currently when fetching all
        metadata from Galaxy, pulp_ansible takes about 380 seconds
        (6+ min). This means that the actual Changeset content
        downloading starts 380 seconds later than it could. At the
        heart of the problem, the fetching+parsing of the metadata
        is not part of the stream processing.

        The additions/removals can be any interable (like generator)
        and by using ChangeSet.apply() and iterating the returned
        object, the pluign can "turn the crank" while downloading and
        processing the metadata. The ChangeSet.apply_and_drain() is
        just a convenience method.  I don't see how this is a
        limitation of the ChangeSet.

    That is new info for me (and maybe everyone). OK so Changesets
    have two interfaces. apply() and apply_and_drain(). Why do we
    have two interfaces when apply() can support all existing use
    cases (that I know of) and do end-to-end stream processing but
    apply_and_drain() cannot? I see all of our examples (and all of
    our new plugins) using apply_and_drain().

    The ChangeSet.apply() was how I designed (and documented) it.  Not
    sure when/who added the apply_and_drain().  +1 for removing it.

I read through the changeset docs. I think this stream processing thing is still a problem but perhaps in how we're presenting the Changeset with it's arguments. I don't think apply() versus apply_and_drain() are at all related. Regardless of if you are using apply() or apply_and_drain(), the Changeset requires an 'additions' and 'removals' arguments. This sends a clear message to the plugin writer that they need to compute additions and removals. They will fetch the metadata to compute these which is mostly how the changeset documentation reads. To know that they could present a generator that would correctly allow the metdata from inside the Changeset is I feel as non-obvious. I want the high-performing implementation to be the obvious one.

So what about a problem statement like this: "Changesets are presented such that when you call into them you should already have fetched the metadata"?

I'm not sure what is meant by "presented".  If this means that we should provide an example of how the ChangeSet can be used by plugins (with large metadata) in such a way that does not require downloading all the metadata first - that sounds like a good idea.

        Do you see the same challenges I do? Are these the right
        problem statements? I think with clear problem statements a
        solution will be easy to see and agree on.

        I'm not convinced that these are actual problems/challenges
        that need to be addressed in the near term.


        Pulp-dev mailing list
        Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>

        Pulp-dev mailing list
        Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>

Pulp-dev mailing list

Reply via email to