Re: [git-users] Synchronizing air gapped git repositories using bundles

Philip Oakley Thu, 02 Feb 2017 16:17:30 -0800

Hi Lowell,

An aside question: If all the machines were co-located, would they still be 
air-gapped those few feet between them, or would they be linked? I ask because 
it helps clarify the way you would serve between the machines.


The bundle is simply a compact version of the on-wire transfer, where the 
negotiation of 'wants', and 'haves'  (which is normally on-the-wire) is done by 
the user. When you fetch from the bundle it is just like fetching from a 
remote, and ony the requested parts (the refspec) is extracted from the bundle.

It may help clarify discussions about what is being transferred.

My example would be:

git bundle create mc1-20161201-20170202.bndl --ALL --since="2016.12.01" master

so the bundle filename contains the full details of what where when...

I've not come up against a 'verify' scenario where the bundle has a partial 
orphan branch from the recipients view, so that if the refspec wanted every ref 
then that one would be metaphorically shallow (and hence not be connected).

The bundles are usually quite small and compact (relative to say a zip file of 
every revision - e.g. right click the top directory and sent to a Compressed 
(zipped) folder - been there, done that), so it shouldn't be an issue (try the 
alternate to compare!)

"Bundle stacking" was the second one as you thought - fetch from Nov16.bndl, 
then Dec16.bndl, then Jan17.bndl, etc.

Using the --branches option is one way to go, but you may get later 
synchronisation problems with the other branches if you don't get sufficient 
depth into the bundle - i.e. getting the right since date, or tag, or merge 
base "master..sidebranch" (two dot notation).


The other point you probably already know is that you can practice this all on 
a single machine! Just init a test repo, add a few commits and branches (with 
faked commit dates etc), then bundle parts of it, now init another repo and try 
fetching fom that bundle, back to the first repo and get another partial 
bundle, and transfer that, etc. etc. This means you can test out all the issues 
easily from the comfort of the chair!

The final bit is that you don't (shouldn't) get any "result in unreferenced 
revisions in the destination repositories", all those revisions fetched from 
the bundle will be linked to a known ref. There may be revs in the bundle that 
aren't brought in, which isn't quite the same question.

Hope that bit of rambling helps.

Philip
PS I understand enough of how I think git works to get me into trouble....  
also.
  ----- Original Message ----- 
  From: Lowell Alleman 
  To: Git for human beings 
  Cc: philipoak...@iee.org 
  Sent: Thursday, February 02, 2017 6:23 PM
  Subject: Re: [git-users] Synchronizing air gapped git repositories using 
bundles


  Philip,


  Thanks for the reply!


  The reason I'm looking at using a script is mostly for standardization.  So 
that the file names are consistent and to capture some bundle metadata 
necessary for the file transfer process (file name, size, checksum, ...)  We 
capture some metadata about the bundle such as: Revision count and some delta 
details (specifically, the the output of  "git diff --stat" and "git log 
--stat").  This helps answer the question about what is being transferred in a 
given bundle.  (And to the best of my knowledge, there's no way to get this 
info from the bundle file itself.)   Secondary reasons for the script comes 
down to mixed levels of user fluency with git, a general mandate to automate 
tasks, and, currently, the script is responsible for tracking the "last export 
point" via tags.  (Oh, and I found it easy to forget to include refs, like 
refs/heads/master in the bundle, and then importing became super painful on the 
other side.)


  I was trying to stick with specific revisions and avoid overlapping exports, 
for a few reasons:  (1) so that we could build a change "manifest" to go along 
with the bundle that would only include what's "new", (2) so that if we need to 
release multiple fixes in a short period of time, like more that one a day, we 
don't end up just copying the same stuff around over and over again (we are 
looking at a scheduled monthly sync up to keep divergence from becoming 
significant, but we may need to sync up multiple times a day on rare 
occasions), and (3) just to generally minimize file transfer size (not a huge 
deal, talking a few MBs).


  I fully agree on the file transfer rejections point.  Hasn't happened yet.  
Policy work is ongoing.



  When you refer to bundle stacking.  Is there a way to specify multiple 
locations to pull from at once, or are you just referring to the fact that you 
can sequentially pull from multiple bundle files.  (I'm assuming the second.)


  Yes, "recording what has been transferred" is exactly the core issue I'm 
facing.  I've noted above some of the reasons I was trying to use a tighter 
revisions selection (using tags) vs using dates, but I'm certainly 
reconsidering that thought process.  The more I think about it the more I'm 
liking it.  That would dramatically simplify the process and workaround the 
inherent issues/limitations of tags (specifically regarding the moving of 
tags).  The fundamental challenge is that there is not repository (in the 
general sense, not necessarily git) that can be accessed from all the of the 
environments.  Ah the joys of air gapped networks... so much fun!


  I guess the biggest down side is just transferring extra stuff around and 
having to write down dates.  (Probably on a wiki, or something like that)


  So let's say we setup a "monthly" transfer schedule, I should be able to use 
something like this:  


     git bundle create mysuff-Jan2017.bundle --since=1.month.ago master


  And if I accidentally skip a month, (which I determine, only after 
transferring the above file), I should be able to do this:


     git bundle create mystuff-Dec2016.bundle --since=2.month.ago 
--before=1.month.ago master




  One thing I'm trying to figure out is if I should include the 
"--branches=master" filter as well?  I don't really want to synchronize other 
branches, which are mostly used for merging or for really big changes.  I also 
don't want to include the "mirror" branches, but that's probably not a big deal 
since most of the revisions are shared between the branches anyways.  And if I 
don't limit the revisions to just the "master" branch, I'm assuming that the 
unwanted branch contains revisions prior to the exported time frame, then I 
suspect that I may end up with revisions dependencies in the bundle that I 
don't want to have.


  In other words, say a bundle contains revisions from "master" and 
"rewrite-it-all" branches.  (These branches are independent of each other.  
Specifically, no merging occurs between these two branches during the exported 
timeframe.)  So because each branch started before the export timeframe, there 
are 2 external revision dependancies.  (Let say:   aaa for master, and bbb for 
rewrite-it-all).   So I have I "aaa" in my local repository, but not "bbb".  So 
this means that git bundle verify would fail.  So at this point, can I still 
import the "master" branch from the bundle?  Or does git require that the 
repository have ALL of the revision dependancies for all revisions in the 
bundle.




  On the other hand, I've also had issues getting bundle create to include some 
very specific merge commit revisions in the past.  (I think these were explicit 
no-change merge commits.)  I fought with it for quite some time but ultimately 
ended up just bundling up the entire repository and distributing that.  I 
really want to avoid that in the future (as the total repository size is 
becoming more significant.)


  So I guess the more fundamental question is this:  Is it better to use the 
macro approach (and risk pushing around lots of extra stuff) which could result 
in unreferenced revisions in the destination repositories, or is it better to 
use the micro-mode and be strategic about just the specific branches/revisions 
I want to synchronize.


  Right now I've been grabbing just the branch I want from the bundle, and 
normally that's all it includes anyways.  (e.g.  git checkout mirror-REPO1; git 
fetch my.repo1.bundle master").  And I'm now wondering if in doing so, I could 
ultimately end up missing necessary revisions that would be imported if I used 
"git fetch".  Is that possible?




  Okay, that's enough rambling.   Thanks again for any help you can provide.   
As you may have gathered, I've been fighting with this process for quite some 
time now.  I probably have an unbalanced knowledge of git that's currently 
working against me.  I understand enough of how I think git works to get me 
into trouble, but not enough to get back out of it. ;-)  And I'm working with a 
rather old version of git 1.7.


  Thanks in advance!




  On Wednesday, February 1, 2017 at 6:47:33 PM UTC-5, Philip Oakley wrote:
    Hi Lowell,

    You can use all of the options in the rev-list for selecting which commits 
are in the bundle (which is just a thin wrapper around the pack file that would 
be sent over the wire). 

    You can include more commits in the bundle than you need [1], that is, have 
an overlap. One option is simply to use the --since=<date> option as a way of 
ensuring you go far enough back in history. Plus the --all to get everything 
after tha date [2].

    I suspect that part of the problem is finding a way of recording what has 
been transferred in the three way transfer - I'd suggest it's just as easy to 
use a small note book (or formal admin log) for recording the date of transfers 
and use that to guide the bundle creation.

    Plus you can always stack up the bundles, so can fetch first from the 
oldest bundle, and then from the newer bundle, etc. 

    I see you have the typical 'transfer review' process for the bundle 
exchange (implies a certain kind of environment ;-) - does it ever fail/reject 
the transfer? or is it simply making sure it is what you thought it was and 
have recorded the transfer correctly (I expect it's actually the latter). If 
you get true rejection you have more issues.

    I don't really think you need a special 'script' (beyond satisfying some 
edict), as the bundle and fetch commands should be sufficient for doing the 
transfer.

    Probably the biggest issue at that point is having a standardised naming 
convention for the bundle file, e.g. server<n>-<datethen>-<datenow>.bndl so 
that you know where it came from, where the --since cut point was, and when it 
was created.

    Then it becomes fairly easy to import/fetch from the bundle acording to the 
carefully mandated process. 

    Philip

    [1] https://git-scm.com/docs/git-bundle
    It is okay to err on the side of caution, causing the bundle file to 
contain objects already in the destination, as these are ignored when unpacking 
at the destination.
    [2] 
http://stackoverflow.com/questions/11792671/how-to-git-bundle-a-complete-repo
      ----- Original Message ----- 
      From: Lowell Alleman 
      To: Git for human beings 
      Sent: Wednesday, February 01, 2017 9:58 PM
      Subject: [git-users] Synchronizing air gapped git repositories using 
bundles


      I have 3 separate air-gapped git repositories (hosted on local GitHub 
enterprise) that I'm trying to keep in sync.   Currently, I'm using "git 
bundle" to push revisions back and forth, which worked fairly well with just 2 
repositories, but I'm struggling a bit since the 3rd (and final) repository has 
been added to the mix.  I was using a single tag to track the point of last 
export as noted in the "git bundle" docs, but I'm struggling to make that scale 
with 2+ total repositories. 


      In terms of information flow, we've deemed one of the repositories as 
"primary" and the other two as "secondary" repositories.  So in a sense we are 
using the "primary" repository like a development and merging area so that all 
changes go through the primary repository and trickle down to the secondary 
repositories.  Changes are always pushed upstream to primary, and then synced 
down to the other secondary repository. 


      Please note that our use of git is more like a "versioned file system" 
than the typical developer use case.  I go on to explain that a bit more later, 
but wanted to get to my main question before everyone gives up on reading this 
really long and complicated explanation of the mess I made. 


      Q:  Does anyone know of any existing scripts, documented methods, or best 
practices to follow when syncing a branch between multiple air-gapped 
repositories?


      How we are using git:  As noted above, this is NOT a typical 
development-centered use-case.  Branching is very infrequent, and most work is 
done on the "master" branch in each repository.  Unlike typical 
developer-centric approaches, each clone (working copy) ends up tied to a 
specific server, rather than a single developer.  So multiple users end up 
working in the same working copy and committing code from one place.  The team 
is small and the changes are infrequent enough that this works for us, despite 
the atypical and less-than-ideal use case.



      How we are using branches:   We treat each repository as if it has just 
one branch, a single "master".  However, because of the synchronization 
requirements, we create special purpose branches in each repository that 
essentially mirror the master branches of the other repositories.  So the 
primary repository has 2 mirrored branches, one for each of the secondary 
repositories.  And each secondary repository has a single mirrored branch that 
represents the primary (upstream) repository.  (By convention, we have agreed 
never to synchronize revisions directly between the two secondary 
repositories.)  Local changes are never applied to a mirrored repository 
branch, so that it should match the "master" branch of the mirrored repository 
exactly.  (That is, the only changes to these mirrored branches are 
fast-forward only "pull"s made from bundle files exported from the mirrored 
repository.)   The process of merging changes between branches is manual, and I 
think I want to keep it that way for the foreseeable future.  (Perhaps one day 
I'll make fast-forward merges apply automatically, but in general I want a 
human to be responsible for this step.)  So while each repositories' "master" 
branch may diverge, or at least have a slightly different history, in the end, 
they should all end up with the same content.  Well, at least that's the 
ultimate goal. 


      File transfer:  Transferring bundle files between air-gapped environments 
involve multiple human steps including content review, approval, and some 
safety checks for compliance.  Therefore, there's no way to automatically 
schedule synchronization, which is a bummer.   That being said, I'd like to 
make this as painless as possible within the realm of what I can control.  I'm 
looking to create import and export scripts (or find existing ones to borrow 
from) that handle bundle creation and the import process. 


      I'm looking for a little help designing an appropriate synchronization 
solution, and would appreciate any feedback you may have.  The combination of 
using git bundle and our non-traditional use case has made it difficult to find 
relevant resources. If there is anything I've missed, please point me in the 
right direction.




      -- 
      You received this message because you are subscribed to the Google Groups 
"Git for human beings" group.
      To unsubscribe from this group and stop receiving emails from it, send an 
email to git-users+...@googlegroups.com.
      For more options, visit https://groups.google.com/d/optout.


  -- 
  You received this message because you are subscribed to the Google Groups 
"Git for human beings" group.
  To unsubscribe from this group and stop receiving emails from it, send an 
email to git-users+unsubscr...@googlegroups.com.
  For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [git-users] Synchronizing air gapped git repositories using bundles

Reply via email to