Re: [RFC/PATCH] Supporting non-blob notes

2014-02-24 Thread ydirson
Johan Herland jo...@herland.net wrote on 02/24/2014 02:29:10: 
 On Wed, Feb 19, 2014 at 12:10 AM, Duy Nguyen pclo...@gmail.com wrote: 
  On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland jo...@herland.net wrote: 
  On Mon, Feb 17, 2014 at 11:48 AM, yann.dir...@bertin.fr wrote: 
  The recent git-note -C changes commit type? thread 
  ( http://thread.gmane.org/gmane.comp.version-control.git/241950 ) looks 
  like a good occasion to discuss possible uses of non-blob notes. 
  
  The use-case we're thinking about is the storage of testrun logs as 
  notes (think: being able to justify that a given set of tests were 
  successfully run on a given revision). 
  
  I think this is a good use of notes, and organizing the testrun logs 
  into a tree of files seems like a natural way to proceed. 
  
  Notes from the previous attempt to store trees as notes (something to 
  watch out maybe, when you do it again) 
  
  http://article.gmane.org/gmane.comp.version-control.git/197712 
 
 Thanks for that link. It is good to see that these issues have been 
 considered/discussed previously. 

Yes, it sheds some useful light on the problem, thanks. 

 I've been thinking about this for a while now, and I find myself 
 agreeing more and more with Junio's argument in the linked thread. 
 
 I think notes are fundamentally - like file contents from Git's POV - 
 an unstructured stream of bytes. Any real structure in a git note is 
 imposed by the surrounding application/context, and having Git impose 
 its own object model onto the contents of notes would likely be an 
 unnecessary distraction. 

OTOH, it looks like a good idea to allow the surrounding application/context 
to benefit from existing infrastructure. I identified so far: 
(i) diffing/grepping trees 
(ii) efficiency of indexing through notes fanout 
(iii) reachability 
(iv) content packing 

 In Yann's example, the testrun logs are probably best structured as a 
 hierarchy of files, but that does not necessarily mean that they MUST 
 be stored as a Git tree object (with accompanying sub-trees and 
 blobs). For example, one could imagine many different solutions for 
 storing the testrun logs: 
 
 (a) Storing the logs statically on some server, and putting the 
 corresponding URL in a notes blob. Reachability is manual/on-demand 
 (be retrieving the URL). 

Would require to redo (ii) and (iv) in a way that does not impait (i) 

 (b) Storing the logs in a .tar.gz archive, and adding that archive as 
 a blob note. Reachability is implicit/automatic (by unpacking the 
 archive). 

Interferes with (i) and (iv), ie. does not allow to benefit from similarity 
between the contents of (unpacked) notes. 

 (c) Storing the logs on some ref in an external repo, and putting the 
 repo URL + ref in a notes blob. Reachability is manual/on-demand (by 
 cloning/fetching the repo). 
 (d) Storing the logs on some ref/commit in the same repo, and putting 
 the ref/commit name in a notes blob. Reachability depends on the 
 application/user to sync the ref/commit along with the notes. 

Better than (a), but still does not address (ii). 
And indeed, my intent was to let the notes live in a separate fork repo, 
so ordinary users need not fetch the testrun contents systematically with the 
code. 

 (e) Storing the logs in a commit, putting the commit name in a blob 
 note, and then creating/rewriting the notes history to include the 
 commit in its ancestry. Reachability is automatic (i.e.follows the 
 notes), but the application must control/manipulate the notes history. 

And finally, that one does address all points in my case. 

 Whichever of these (or other) solutions is most appropriate depends on 
 the particular application/context, and (from Git's perspective), none 
 of them are inherently superior to any of the other. Even the question 
 of whether testrun logs should or should not be reachable by default, 
 depends on the surrounding application/context. 

Wouldn't it make sense to mention these possibilities in the git-notes 
manpage, to help people use the mechanism as intended ? 

 Now, the intention of Yann's RFC is to store the testrun logs directly 
 in a notes _tree_. This is not too different from alternative (e) 
 above, in that reachability is automatic. However, instead of having 
 the surrounding application manipulate the notes history to ensure 
 reachability, the RFC would rather teach Git's notes code to 
 accomodate the (likely rather special) case of having a note that is 
 BOTH structured like (or at least easily mapped to) a Git tree object, 
 AND that should be automatically reachable. 

Incidently, proposal (e) would allow the use of commits, although 
doing so would probably cause problems, not all of the children of the 
commit used as annotation having the same relationship to their parent. 

Are you suggesting using a slightly different mechanism than 
the parent relationship ? 

 Even though there is a certain elegance to storing such a tree object 
 

Re: [RFC/PATCH] Supporting non-blob notes

2014-02-24 Thread Johan Herland
On Mon, Feb 24, 2014 at 11:27 AM,  ydir...@free.fr wrote:
 Johan Herland jo...@herland.net wrote on 02/24/2014 02:29:10:
 I've been thinking about this for a while now, and I find myself
 agreeing more and more with Junio's argument in the linked thread.

 I think notes are fundamentally - like file contents from Git's POV -
 an unstructured stream of bytes. Any real structure in a git note is
 imposed by the surrounding application/context, and having Git impose
 its own object model onto the contents of notes would likely be an
 unnecessary distraction.

 OTOH, it looks like a good idea to allow the surrounding application/context
 to benefit from existing infrastructure. I identified so far:

 (i) diffing/grepping trees
 (ii) efficiency of indexing through notes fanout

All of my proposed alternatives store some sort of reference to the
real data in a notes object; even when using a tree object directly
as a note, the notes tree itself only stores a SHA1 reference to the
tree object. As such, all alternatives (a) through (e) (even including
your RFC) benefit from indexing through the notes fanout, and I'm not
sure what is gained by attaching the real data more directly to the
notes. In all of (a) through (e), the lookup of a specific commit's
testrun logs always start with doing a lookup of the notes associated
with a given commit. Once that is done, the remainder of the work is
about resolving that reference and retrieving the associated resource,
Whether the consists of loading an HTTP URL, fetching a remote Git
repo, or looking up a local tree object is ultimately an
implementation detail, and does not affect the indexing itself.

 (iii) reachability
 (iv) content packing

These four criteria/requirements apply to your specific use case, but
they do not necessarily apply to _all_ use cases. I can easily imagine
a slightly different scenario: For example, a company setting with
highly-available internal servers, and where testrun logs are
primarily interesting to a small subset of users (e.g. most developers
only look at them very occasionally). Now assume there is already a
(third-party) system in place for archiving and indexing the testrun
logs (i.e. providing (i), (ii) and (iv)), and direct reachability
(iii) is not desired as including the testrun logs in the repo would
add nothing but bloat for most users. In this scenario, simply adding
a note with the appropriate URL to the third-party service would be a
sufficient and preferable solution.

 In Yann's example, the testrun logs are probably best structured as a
 hierarchy of files, but that does not necessarily mean that they MUST
 be stored as a Git tree object (with accompanying sub-trees and
 blobs). For example, one could imagine many different solutions for
 storing the testrun logs:

 (a) Storing the logs statically on some server, and putting the
 corresponding URL in a notes blob. Reachability is manual/on-demand
 (be retrieving the URL).

 Would require to redo (ii) and (iv) in a way that does not impait (i)

 (b) Storing the logs in a .tar.gz archive, and adding that archive as
 a blob note. Reachability is implicit/automatic (by unpacking the
 archive).

 Interferes with (i) and (iv), ie. does not allow to benefit from similarity
 between the contents of (unpacked) notes.

 (c) Storing the logs on some ref in an external repo, and putting the
 repo URL + ref in a notes blob. Reachability is manual/on-demand (by
 cloning/fetching the repo).
 (d) Storing the logs on some ref/commit in the same repo, and putting
 the ref/commit name in a notes blob. Reachability depends on the
 application/user to sync the ref/commit along with the notes.

 Better than (a), but still does not address (ii).
 And indeed, my intent was to let the notes live in a separate fork repo,
 so ordinary users need not fetch the testrun contents systematically with the
 code.

Just to clarify, my alternatives (except for (e) below) were not
intended to satisfy the exact criteria for your use case, but only to
demonstrate that there exist a variety of solutions for a variety of
slightly different problems. When we consider adding significant
complexity to the notes code, we must justify that with real and
tangible benefits, not only for your exact use case, but preferably
also for a larger group of related use cases. So far I don't see how
allowing the direct use of tree objects as notes benefit more than
your specific use case...

 (e) Storing the logs in a commit, putting the commit name in a blob
 note, and then creating/rewriting the notes history to include the
 commit in its ancestry. Reachability is automatic (i.e.follows the
 notes), but the application must control/manipulate the notes history.

 And finally, that one does address all points in my case.

 Whichever of these (or other) solutions is most appropriate depends on
 the particular application/context, and (from Git's perspective), none
 of them are inherently superior to any of the other. Even the 

Re: [RFC/PATCH] Supporting non-blob notes

2014-02-23 Thread Johan Herland
On Wed, Feb 19, 2014 at 12:10 AM, Duy Nguyen pclo...@gmail.com wrote:
 On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland jo...@herland.net wrote:
 On Mon, Feb 17, 2014 at 11:48 AM,  yann.dir...@bertin.fr wrote:
 The recent git-note -C changes commit type? thread
 (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
 like a good occasion to discuss possible uses of non-blob notes.

 The use-case we're thinking about is the storage of testrun logs as
 notes (think: being able to justify that a given set of tests were
 successfully run on a given revision).

 I think this is a good use of notes, and organizing the testrun logs
 into a tree of files seems like a natural way to proceed.

 Notes from the previous attempt to store trees as notes (something to
 watch out maybe, when you do it again)

 http://article.gmane.org/gmane.comp.version-control.git/197712

Thanks for that link. It is good to see that these issues have been
considered/discussed previously.

I've been thinking about this for a while now, and I find myself
agreeing more and more with Junio's argument in the linked thread.

I think notes are fundamentally - like file contents from Git's POV -
an unstructured stream of bytes. Any real structure in a git note is
imposed by the surrounding application/context, and having Git impose
its own object model onto the contents of notes would likely be an
unnecessary distraction.

In Yann's example, the testrun logs are probably best structured as a
hierarchy of files, but that does not necessarily mean that they MUST
be stored as a Git tree object (with accompanying sub-trees and
blobs). For example, one could imagine many different solutions for
storing the testrun logs:

 (a) Storing the logs statically on some server, and putting the
corresponding URL in a notes blob. Reachability is manual/on-demand
(be retrieving the URL).

 (b) Storing the logs in a .tar.gz archive, and adding that archive as
a blob note. Reachability is implicit/automatic (by unpacking the
archive).

 (c) Storing the logs on some ref in an external repo, and putting the
repo URL + ref in a notes blob. Reachability is manual/on-demand (by
cloning/fetching the repo).

 (d) Storing the logs on some ref/commit in the same repo, and putting
the ref/commit name in a notes blob. Reachability depends on the
application/user to sync the ref/commit along with the notes.

 (e) Storing the logs in a commit, putting the commit name in a blob
note, and then creating/rewriting the notes history to include the
commit in its ancestry. Reachability is automatic (i.e.follows the
notes), but the application must control/manipulate the notes history.

Whichever of these (or other) solutions is most appropriate depends on
the particular application/context, and (from Git's perspective), none
of them are inherently superior to any of the other. Even the question
of whether testrun logs should or should not be reachable by default,
depends on the surrounding application/context.

Now, the intention of Yann's RFC is to store the testrun logs directly
in a notes _tree_. This is not too different from alternative (e)
above, in that reachability is automatic. However, instead of having
the surrounding application manipulate the notes history to ensure
reachability, the RFC would rather teach Git's notes code to
accomodate the (likely rather special) case of having a note that is
BOTH structured like (or at least easily mapped to) a Git tree object,
AND that should be automatically reachable.

Even though there is a certain elegance to storing such a tree object
directly as a notes object, there is AFAICS no other inherent
advantage (e.g. performance- or functionality-wise) to following that
approach. I'm not at all sure that it justifies increasing the
complexity of the notes code.

Furthermore, considering the RFC's original intention of also making
commit and tag objects directly usable as notes, and realizing the
fundamental difficulties in teaching Git to handle this (outlined in
my previous email in this thread), I must conclude that the simplicity
and flexibility of something like alternative (e) above far outweighs
the added code complexity to support allowing any object type to be
used as a note.

Maybe we should instead consider making it easier to do alternative
(e), by providing a command-line option for supplying additional
parents to a notes commit?


...Johan

[1]: The only structure in notes contents expected by Git is the
text format expected when showing notes with git log, or when
editing/appending notes with your default text editor. However, these
are typically bypassed and/or customized by an external application
storing custom data in notes.

-- 
Johan Herland, jo...@herland.net
www.herland.net
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] Supporting non-blob notes

2014-02-18 Thread Johan Herland
On Mon, Feb 17, 2014 at 11:48 AM,  yann.dir...@bertin.fr wrote:
 The recent git-note -C changes commit type? thread
 (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
 like a good occasion to discuss possible uses of non-blob notes.

 The use-case we're thinking about is the storage of testrun logs as
 notes (think: being able to justify that a given set of tests were
 successfully run on a given revision).

I think this is a good use of notes, and organizing the testrun logs
into a tree of files seems like a natural way to proceed.

 Here is a proof-of-concept patch (that applies to 1.8.4.2) I've been
 playing with.  Because of the -C behaviour described in this other
 thread, I opted for a new -o flag that would not mess with the object
 argument.  This patch is very minimalist, and just allows storing a
 tree note (currently any type of object, but that's easy to restrict
 if we want to), and retrieving it.

I think we must think _very_ carefully about which object types we
allow to be stored in notes trees.

As far as I can see, you use case (storing testrun logs) is covered
nicely by allowing tree objects as notes, and I think that's where we
should start. The note tree is itself a tree object, and storing
sub-trees of that is not new or unusual to Git at all. Reachability is
nicely covered by how Git already handles sub-trees. Obviously we must
flesh out how the notes-related parts of the code deal with trees (see
below), but that does not really affect the rest of Git, and should
therefore be relatively uncontroversial.

If we go on to _commit_ objects, they are currently only referenced
from tree objects as gitlinks (with a special 16 mode). If you
were to put one of these in a notes tree, you would get the same
semantics as a gitlink, i.e. git handles that part of the tree as a
submodule where a different submodule repo is (to be) checked out. The
commit is NOT considered/required to be reachable, and would therefore
not be automatically communicated by a fetch or push.

So if you want commits in a notes tree to be handled differently from
commits-as-gitlinks, you would have to tweak all the code in Git that
deal with gitlinks. You would have to introduce a differentiation
between your commits-as-gitlinks and commits-as-notes, either by
reserving another special mode number, or by otherwise making the rest
of Git notes-aware. All of this comes in addition to teaching the
notes-related code how to deal with commits (i.e. how to display them,
etc.).

In other words, before you embark on this, you need a convincing
argument for why allowing commits-as-notes is really necessary and
worth it in the end. Please also consider that you _can_ support
commits-as-notes by the mechanism I suggested in the previous thread:
Store the commit SHA1 in a note-as-blob, and then amend the notes
commit to include the commit SHA1 as an additional parent. It's not
very elegant, but it solves the reachability problem.

If we go even further and want to allow ANY git object as a note, then
we must also consider tag objects, which AFAIK has never before been
stored inside a tree. Here we are really entering uncharted
territory...

So for now (and in lieu of a convincing use case for
notes-as-commits), I suggest you only look at notes-as-trees. The
first consequence of this is probably that your added -o/--object
option should be renamed. -t/--tree is not taken, AFAICS...

 Johan Herland wrote:
 Obviously, it would not make sense to use refs/notes/history while
 displaying the commit log (git log --notes=history), as the raw
 commit object would be shown in the log.

 Currently, a non-blob commit is just not displayed at all.  And rather
 than displaying the raw object, we have a number of options available,
 starting with object's sha1, to more elaborate presentations depending
 on the type of object (commit info, tree hierarchy, etc, as git notes
 show already does).  This PoC shows that it can be dealt with later.

I'm only considering the notes-as-tree case here...

I assume that if you organize your notes in tree objects, then you
probably have more information in there than is useful to display in
the textual output from git log. Also, you probably have
special-purpose scripts for initially generating those trees, and
later digging into the information stored therein. Hence we should
concentrate on getting the basics covered, to allow those scripts to
do their thing, and adding bells and whistles to git log for
displaying notes-as-trees is much less important. For now, git log
should probably show a short summary when encountering a
notes-as-tree. Whether that summary consists of merely the tree SHA1,
or in providing a (relatively short) tree listing, I leave up to you.
I also agree that this can be dealt with later (as long as the default
behaviour is not actively harmful/confusing).

 What I envision, would be viewers like gitk simply show the
 hyperlinked sha1, and (in the case of a tree 

Re: [RFC/PATCH] Supporting non-blob notes

2014-02-18 Thread Duy Nguyen
On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland jo...@herland.net wrote:
 On Mon, Feb 17, 2014 at 11:48 AM,  yann.dir...@bertin.fr wrote:
 The recent git-note -C changes commit type? thread
 (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
 like a good occasion to discuss possible uses of non-blob notes.

 The use-case we're thinking about is the storage of testrun logs as
 notes (think: being able to justify that a given set of tests were
 successfully run on a given revision).

 I think this is a good use of notes, and organizing the testrun logs
 into a tree of files seems like a natural way to proceed.

Notes from the previous attempt to store trees as notes (something to
watch out maybe, when you do it again)

http://article.gmane.org/gmane.comp.version-control.git/197712
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html