On 04/12/2018 10:01 AM, Brian Bouterse wrote:

On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <jor...@redhat.com <mailto:jor...@redhat.com>> wrote:

    On 04/11/2018 03:29 PM, Brian Bouterse wrote:
    I think we should look into this in the near-term. Changing an
    interface on an object used by all plugins will be significantly
    easier, earlier.

    On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jor...@redhat.com
    <mailto:jor...@redhat.com>> wrote:

        On 04/11/2018 10:59 AM, Brian Bouterse wrote:

        On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
        <jor...@redhat.com <mailto:jor...@redhat.com>> wrote:





            On 04/06/2018 09:15 AM, Brian Bouterse wrote:
            Several plugins have started using the Changesets
            including pulp_ansible, pulp_python, pulp_file, and
            perhaps others. The Changesets provide several distinct
            points of value which are great, but there are two
            challenges I want to bring up. I want to focus only on
            the problem statements first.

            1. There is redundant "differencing" code in all
            plugins. The Changeset interface requires the plugin
            writer to determine what units need to be added and
            those to be removed. This requires all plugin writers
            to write the same non-trivial differencing code over
            and over. For example, you can see the same non-trivial
            differencing code present in pulp_ansible
            and pulp_python
            Line-wise, this "differencing" code makes up a large
            portion (maybe 50%) of the sync code itself in each plugin.

            Ten lines of trivial set logic hardly seems like a big
            deal but any duplication is worth exploring.

        It's more than ten lines. Take pulp_ansible for example. By
        my count (the linked to section) it's 89 lines, which out of
        306 lines of plugin code for sync is 29% of extra redundant
        code. The other plugins have similar numbers. So with those
        numbers in mind, what do you think?

        I was counting the lines (w/o comments) in find_delta() based
        on the linked code.  Which functions are you counting?

    I was counting the find_delta, build_additions, and
    build_removals methods. Regardless of how the lines are counted,
    that differencing code is the duplication I'm talking about.
    There isn't a way to use the changesets without duplicating that
    differencing code in a plugin.

    The differencing code is limited to find_delta() and perhaps
    build_removals().  Agreed, the line count is less useful than
    specifically identifying duplicate code.  Outside of find_delta(),
    I see similar code (in part because it got copied from file
    plugin) but not seeing actual duplication.  Can you be more specific?

Very similar code or identical code, I think it begs the question why are we having plugin writer's do this at all? What value are they creating with it? I don't have a reasonable answer to that question, so the requirement for plugin writer's to write that code brings me back to the problem statement: "plugin writers have redundant differencing code when using Changesets". More info on why it is valuable for the plugin writer to do the differencing code versus the Changesets would be helpful.

The ChangeSet abstraction (and API) is based on following division of responsibility:

The plugin  (with an understanding of the remote and its content):
  - Download metadata.
  - Parse metadata
  - Based on the metadata:
    - determine content to be added to the repository.
      - define how artifacts are downloaded.
      - construct content
    - determine content to be removed to the repository.

Core (without understand of specific remote or its content):
  - Provide low level API for plugin to affect the changes it has determined need to be made to the repository.  This is downloaders, models etc.   - Provide high(er) level API for plugin to affect the changes it has determined need to be made to the repository.  This is the ChangeSet.

Are you proposing that this is not the correct division?

    So a shorter, simpler problem statement is: "to use the
    changesets plugin writers have to do extra work to compute
    additions and removals parameters".

    This statement ^ is better but still too vague to actually solve. 
    Can we elaborate on specifically what "to do extra work" means?

Sure. Removing that vague language is one way to resolve its vagueness. Here's a revised problem statement: "to use the changesets plugin writers have to compute additions and removals parameters". This problem statement would be resolved by a solution that causes the plugin writer to never have to produce these parameters and be replaced by an interface that would require less effort from a plugin writer.

I think it's the plugin's responsibility to determine the difference.  Aside from that: without an understanding of the metadata and content type, how could the ChangeSet do this?  What might that looks like?

            2. Plugins can't do end-to-end stream processing. The
            Changesets themselves do stream processing, but when
            you call into changeset.apply_and_drain() you have to
            have fully parsed the metadata already. Currently when
            fetching all metadata from Galaxy, pulp_ansible takes
            about 380 seconds (6+ min). This means that the actual
            Changeset content downloading starts 380 seconds later
            than it could. At the heart of the problem, the
            fetching+parsing of the metadata is not part of the
            stream processing.

            The additions/removals can be any interable (like
            generator) and by using ChangeSet.apply() and iterating
            the returned object, the pluign can "turn the crank"
            while downloading and processing the metadata.  The
            ChangeSet.apply_and_drain() is just a convenience
            method.  I don't see how this is a limitation of the

        That is new info for me (and maybe everyone). OK so
        Changesets have two interfaces. apply() and
        apply_and_drain(). Why do we have two interfaces when
        apply() can support all existing use cases (that I know of)
        and do end-to-end stream processing but apply_and_drain()
        cannot? I see all of our examples (and all of our new
        plugins) using apply_and_drain().

        The ChangeSet.apply() was how I designed (and documented) it.
        Not sure when/who added the apply_and_drain().  +1 for
        removing it.

    I read through the changeset docs. I think this stream processing
    thing is still a problem but perhaps in how we're presenting the
    Changeset with it's arguments. I don't think apply() versus
    apply_and_drain() are at all related. Regardless of if you are
    using apply() or apply_and_drain(), the Changeset requires an
    'additions' and 'removals' arguments. This sends a clear message
    to the plugin writer that they need to compute additions and
    removals. They will fetch the metadata to compute these which is
    mostly how the changeset documentation reads. To know that they
    could present a generator that would correctly allow the metdata
    from inside the Changeset is I feel as non-obvious. I want the
    high-performing implementation to be the obvious one.

    So what about a problem statement like this: "Changesets are
    presented such that when you call into them you should already
    have fetched the metadata"?

    I'm not sure what is meant by "presented".  If this means that we
    should provide an example of how the ChangeSet can be used by
    plugins (with large metadata) in such a way that does not require
    downloading all the metadata first - that sounds like a good idea.

Cool so this is transitioning to ideas for resolution. The solution to add documentation on how to do this with the existing interface is one option. My concern with adding additional docs on how to use the current interface better is that if users choose to follow the existing docs then they will have the stream processing problem once again. To me, this suggests that this new example should actually replace the existing documentation.

Seems like both example would be useful.  I'm not convinced that all plugins would benefit from this.  For example: the File plugin manifest is small and would likely not benefit from the extra complexity.  For complicated plugins (like RPM), can differencing decision be made before analyzing the entire metadata (eg: primary.xml)?  Also, it's not clear to me how this would work using the Downloader.  Are you suggesting that the plugin would parse/process metadata files while they're being downloaded? Perhaps a better understanding of the flow to be supported would help me understand this.

            Do you see the same challenges I do? Are these the
            right problem statements? I think with clear problem
            statements a solution will be easy to see and agree on.

            I'm not convinced that these are actual
            problems/challenges that need to be addressed in the
            near term.


            Pulp-dev mailing list
            Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>

            Pulp-dev mailing list
            Pulp-dev@redhat.com <mailto:Pulp-dev@redhat.com>

Pulp-dev mailing list

Reply via email to