tl;dr - i'm suggesting a new file syncing protocol
for portage syncing.  details of this one is in
section 2.


1. background
-------------
rsync needs to read all files in order to compare
them.  this is too expensive and doesn't scale as
portage's tree grows in size..

on the other hand, git gets away with this, by
maintaining a history of edits.  so git doesn't
need to compare all files, instead it walks
through the history.

but git has another issue:  the history getting
too big.  this causes:
    - `git clone` to needlessly take too long, as
      many old histories become irrelevant as they
      get fully overwridden by newer ones.
    - this also causes `git pull` to be slower
      than needed, as the history is not ideally
      compressed.
    - plus, the disk space that's wasted for
      histories.


2. new protocol
---------------
to solve issues above, i think the ideal solution
is this protocol:
    - each history is a number representing a
      logical clock.  1st history is 0, 2nd is 1,
      etc.
    - the server maintains a list of N past many
      histories of the portage tree.
    - when a client requests to update its portage
      tree, it tells the server its current
      history.  e.g. say client is currently
      located in logical time 1234567.
    - the server is maintaining only the past N
      histories:
        - if 1234567 is behind those maintained N
          ones, then the server sends a full
          portage tree from scratch.
        - if 1234567 is within those maintained N
          ones, then the server has two options:
            (1) either send all changes since
                1234567, as they happened
                historically.  this is a bad idea.
                no good reason for it.

            (2) better: the server can send the
                compressed histories.  compressed
                histories are done once, and
                cached, in a scalable way.  the
                cache itself is incremental, so
                updating the cache is cheap
                (details section 2.2.).

                e.g. if there are 5000 histories
                that the client lacks since time
                1234567, then there is a chance
                that many of the changes are just
                a waste of time.  e.g. add a file,
                then delete the same file, then
                add a different file again.  so
                why not just lie about the
                history, and send the last file,
                escaping ones int he middle?  same
                can be thought about diffs to code
                blocks.

2.1. properties of this new protocol
------------------------------------
so this new protocol has these properties:
    - unlike rsync, it doesn't need to compare all files
      individually.
    - unlike git, the history doesn't grow on the
      client.  history remains only a single
      number representing a logical clock.
    - the history on the server is limited to N
      past entries.  no devs will cry, because
      this is not a code collaboration app, but
      simply a file synchronisation app to replace
      rsync.  so the admins are free to set N as
      small as they please, without worrying about
      harming collaborating devs.
    - server has the option to compress histories
      to clients, and these histories are
      cacheable for more performance.


2.2. how it will feel to admins/devs
------------------------------------
    - the devs simply commit their changes to the
      portage tree via git.
    - the git server will have hooks to execute an
      external command for this new protocol, that
      will calculate all diffs necessary in order
      to build a new history.

      e.g. if current history is 30000, and a dev
      makes a new commit via git, then the git
      hooks will execute the external command to
      calculate the diff for the affected files by
      the git commit, such that history 30001 is
      created.

      the hooked external command will also see if
      it can compress the histories, for the past
      M many entries since 30001.

      so that clients that live in time 30001-M,
      who ask for 30001, can get the compressed
      history instead of raw actual histories from
      30001-m to 30001.

ty,
cm.


Reply via email to