Re: [RFC/PATCH] Supporting non-blob notes

2014-02-24 Thread Johan Herland
On Mon, Feb 24, 2014 at 11:27 AM,   wrote:
> Johan Herland  wrote on 02/24/2014 02:29:10:
>> I've been thinking about this for a while now, and I find myself
>> agreeing more and more with Junio's argument in the linked thread.
>>
>> I think notes are fundamentally - like file contents from Git's POV -
>> an unstructured stream of bytes. Any real structure in a git note is
>> imposed by the surrounding application/context, and having Git impose
>> its own object model onto the contents of notes would likely be an
>> unnecessary distraction.
>
> OTOH, it looks like a good idea to allow the surrounding application/context
> to benefit from existing infrastructure. I identified so far:
>
> (i) diffing/grepping trees
> (ii) efficiency of indexing through notes fanout

All of my proposed alternatives store some sort of reference to the
"real" data in a notes object; even when using a tree object directly
as a note, the notes tree itself only stores a SHA1 reference to the
tree object. As such, all alternatives (a) through (e) (even including
your RFC) benefit from indexing through the notes fanout, and I'm not
sure what is gained by attaching the "real" data more directly to the
notes. In all of (a) through (e), the lookup of a specific commit's
testrun logs always start with doing a lookup of the notes associated
with a given commit. Once that is done, the remainder of the work is
about resolving that reference and retrieving the associated resource,
Whether the consists of loading an HTTP URL, fetching a remote Git
repo, or looking up a local tree object is ultimately an
implementation detail, and does not affect the indexing itself.

> (iii) reachability
> (iv) content packing

These four criteria/requirements apply to your specific use case, but
they do not necessarily apply to _all_ use cases. I can easily imagine
a slightly different scenario: For example, a company setting with
highly-available internal servers, and where testrun logs are
primarily interesting to a small subset of users (e.g. most developers
only look at them very occasionally). Now assume there is already a
(third-party) system in place for archiving and indexing the testrun
logs (i.e. providing (i), (ii) and (iv)), and direct reachability
(iii) is not desired as including the testrun logs in the repo would
add nothing but bloat for most users. In this scenario, simply adding
a note with the appropriate URL to the third-party service would be a
sufficient and preferable solution.

>> In Yann's example, the testrun logs are probably best structured as a
>> hierarchy of files, but that does not necessarily mean that they MUST
>> be stored as a Git tree object (with accompanying sub-trees and
>> blobs). For example, one could imagine many different solutions for
>> storing the testrun logs:
>>
>> (a) Storing the logs statically on some server, and putting the
>> corresponding URL in a notes blob. Reachability is manual/on-demand
>> (be retrieving the URL).
>
> Would require to redo (ii) and (iv) in a way that does not impait (i)
>
>> (b) Storing the logs in a .tar.gz archive, and adding that archive as
>> a blob note. Reachability is implicit/automatic (by unpacking the
>> archive).
>
> Interferes with (i) and (iv), ie. does not allow to benefit from similarity
> between the contents of (unpacked) notes.
>
>> (c) Storing the logs on some ref in an external repo, and putting the
>> repo URL + ref in a notes blob. Reachability is manual/on-demand (by
>> cloning/fetching the repo).
>> (d) Storing the logs on some ref/commit in the same repo, and putting
>> the ref/commit name in a notes blob. Reachability depends on the
>> application/user to sync the ref/commit along with the notes.
>
> Better than (a), but still does not address (ii).
> And indeed, my intent was to let the notes live in a separate "fork" repo,
> so ordinary users need not fetch the testrun contents systematically with the
> code.

Just to clarify, my alternatives (except for (e) below) were not
intended to satisfy the exact criteria for your use case, but only to
demonstrate that there exist a variety of solutions for a variety of
slightly different problems. When we consider adding significant
complexity to the notes code, we must justify that with real and
tangible benefits, not only for your exact use case, but preferably
also for a larger group of related use cases. So far I don't see how
allowing the direct use of tree objects as notes benefit more than
your specific use case...

>> (e) Storing the logs in a commit, putting the commit name in a blob
>> note, and then creating/rewriting the notes history to include the
>> commit in its ancestry. Reachability is automatic (i.e.follows the
>> notes), but the application must control/manipulate the notes history.
>
> And finally, that one does address all points in my case.
>
>> Whichever of these (or other) solutions is most appropriate depends on
>> the particular application/context, and (from Git's perspective), none
>

Re: [RFC/PATCH] Supporting non-blob notes

2014-02-24 Thread ydirson
Johan Herland  wrote on 02/24/2014 02:29:10: 
> On Wed, Feb 19, 2014 at 12:10 AM, Duy Nguyen  wrote: 
> > On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland  wrote: 
> >> On Mon, Feb 17, 2014 at 11:48 AM,  wrote: 
> >>> The recent "git-note -C changes commit type?" thread 
> >>> ( http://thread.gmane.org/gmane.comp.version-control.git/241950 ) looks 
> >>> like a good occasion to discuss possible uses of non-blob notes. 
> >>> 
> >>> The use-case we're thinking about is the storage of testrun logs as 
> >>> notes (think: being able to justify that a given set of tests were 
> >>> successfully run on a given revision). 
> >> 
> >> I think this is a good use of notes, and organizing the testrun logs 
> >> into a tree of files seems like a natural way to proceed. 
> > 
> > Notes from the previous attempt to store trees as notes (something to 
> > watch out maybe, when you do it again) 
> > 
> > http://article.gmane.org/gmane.comp.version-control.git/197712 
> 
> Thanks for that link. It is good to see that these issues have been 
> considered/discussed previously. 

Yes, it sheds some useful light on the problem, thanks. 

> I've been thinking about this for a while now, and I find myself 
> agreeing more and more with Junio's argument in the linked thread. 
> 
> I think notes are fundamentally - like file contents from Git's POV - 
> an unstructured stream of bytes. Any real structure in a git note is 
> imposed by the surrounding application/context, and having Git impose 
> its own object model onto the contents of notes would likely be an 
> unnecessary distraction. 

OTOH, it looks like a good idea to allow the surrounding application/context 
to benefit from existing infrastructure. I identified so far: 
(i) diffing/grepping trees 
(ii) efficiency of indexing through notes fanout 
(iii) reachability 
(iv) content packing 

> In Yann's example, the testrun logs are probably best structured as a 
> hierarchy of files, but that does not necessarily mean that they MUST 
> be stored as a Git tree object (with accompanying sub-trees and 
> blobs). For example, one could imagine many different solutions for 
> storing the testrun logs: 
> 
> (a) Storing the logs statically on some server, and putting the 
> corresponding URL in a notes blob. Reachability is manual/on-demand 
> (be retrieving the URL). 

Would require to redo (ii) and (iv) in a way that does not impait (i) 

> (b) Storing the logs in a .tar.gz archive, and adding that archive as 
> a blob note. Reachability is implicit/automatic (by unpacking the 
> archive). 

Interferes with (i) and (iv), ie. does not allow to benefit from similarity 
between the contents of (unpacked) notes. 

> (c) Storing the logs on some ref in an external repo, and putting the 
> repo URL + ref in a notes blob. Reachability is manual/on-demand (by 
> cloning/fetching the repo). 
> (d) Storing the logs on some ref/commit in the same repo, and putting 
> the ref/commit name in a notes blob. Reachability depends on the 
> application/user to sync the ref/commit along with the notes. 

Better than (a), but still does not address (ii). 
And indeed, my intent was to let the notes live in a separate "fork" repo, 
so ordinary users need not fetch the testrun contents systematically with the 
code. 

> (e) Storing the logs in a commit, putting the commit name in a blob 
> note, and then creating/rewriting the notes history to include the 
> commit in its ancestry. Reachability is automatic (i.e.follows the 
> notes), but the application must control/manipulate the notes history. 

And finally, that one does address all points in my case. 

> Whichever of these (or other) solutions is most appropriate depends on 
> the particular application/context, and (from Git's perspective), none 
> of them are inherently superior to any of the other. Even the question 
> of whether testrun logs should or should not be reachable by default, 
> depends on the surrounding application/context. 

Wouldn't it make sense to mention these possibilities in the git-notes 
manpage, to help people use the mechanism as intended ? 

> Now, the intention of Yann's RFC is to store the testrun logs directly 
> in a notes _tree_. This is not too different from alternative (e) 
> above, in that reachability is automatic. However, instead of having 
> the surrounding application manipulate the notes history to ensure 
> reachability, the RFC would rather teach Git's notes code to 
> accomodate the (likely rather special) case of having a note that is 
> BOTH structured like (or at least easily mapped to) a Git tree object, 
> AND that should be automatically reachable. 

Incidently, proposal (e) would allow the use of commits, although 
doing so would probably cause problems, not all of the children of the 
commit used as annotation having the same relationship to their parent. 

Are you suggesting using a slightly different mechanism than 
the "parent" relationship ? 

> Even though there is a certain elegance to 

Re: [RFC/PATCH] Supporting non-blob notes

2014-02-23 Thread Johan Herland
On Wed, Feb 19, 2014 at 12:10 AM, Duy Nguyen  wrote:
> On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland  wrote:
>> On Mon, Feb 17, 2014 at 11:48 AM,   wrote:
>>> The recent "git-note -C changes commit type?" thread
>>> (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
>>> like a good occasion to discuss possible uses of non-blob notes.
>>>
>>> The use-case we're thinking about is the storage of testrun logs as
>>> notes (think: being able to justify that a given set of tests were
>>> successfully run on a given revision).
>>
>> I think this is a good use of notes, and organizing the testrun logs
>> into a tree of files seems like a natural way to proceed.
>
> Notes from the previous attempt to store trees as notes (something to
> watch out maybe, when you do it again)
>
> http://article.gmane.org/gmane.comp.version-control.git/197712

Thanks for that link. It is good to see that these issues have been
considered/discussed previously.

I've been thinking about this for a while now, and I find myself
agreeing more and more with Junio's argument in the linked thread.

I think notes are fundamentally - like file contents from Git's POV -
an unstructured stream of bytes. Any real structure in a git note is
imposed by the surrounding application/context, and having Git impose
its own object model onto the contents of notes would likely be an
unnecessary distraction.

In Yann's example, the testrun logs are probably best structured as a
hierarchy of files, but that does not necessarily mean that they MUST
be stored as a Git tree object (with accompanying sub-trees and
blobs). For example, one could imagine many different solutions for
storing the testrun logs:

 (a) Storing the logs statically on some server, and putting the
corresponding URL in a notes blob. Reachability is manual/on-demand
(be retrieving the URL).

 (b) Storing the logs in a .tar.gz archive, and adding that archive as
a blob note. Reachability is implicit/automatic (by unpacking the
archive).

 (c) Storing the logs on some ref in an external repo, and putting the
repo URL + ref in a notes blob. Reachability is manual/on-demand (by
cloning/fetching the repo).

 (d) Storing the logs on some ref/commit in the same repo, and putting
the ref/commit name in a notes blob. Reachability depends on the
application/user to sync the ref/commit along with the notes.

 (e) Storing the logs in a commit, putting the commit name in a blob
note, and then creating/rewriting the notes history to include the
commit in its ancestry. Reachability is automatic (i.e.follows the
notes), but the application must control/manipulate the notes history.

Whichever of these (or other) solutions is most appropriate depends on
the particular application/context, and (from Git's perspective), none
of them are inherently superior to any of the other. Even the question
of whether testrun logs should or should not be reachable by default,
depends on the surrounding application/context.

Now, the intention of Yann's RFC is to store the testrun logs directly
in a notes _tree_. This is not too different from alternative (e)
above, in that reachability is automatic. However, instead of having
the surrounding application manipulate the notes history to ensure
reachability, the RFC would rather teach Git's notes code to
accomodate the (likely rather special) case of having a note that is
BOTH structured like (or at least easily mapped to) a Git tree object,
AND that should be automatically reachable.

Even though there is a certain elegance to storing such a tree object
directly as a notes object, there is AFAICS no other inherent
advantage (e.g. performance- or functionality-wise) to following that
approach. I'm not at all sure that it justifies increasing the
complexity of the notes code.

Furthermore, considering the RFC's original intention of also making
commit and tag objects directly usable as notes, and realizing the
fundamental difficulties in teaching Git to handle this (outlined in
my previous email in this thread), I must conclude that the simplicity
and flexibility of something like alternative (e) above far outweighs
the added code complexity to support allowing any object type to be
used as a note.

Maybe we should instead consider making it easier to do alternative
(e), by providing a command-line option for supplying additional
parents to a notes commit?


...Johan

[1]: The only "structure" in notes contents expected by Git is the
text format expected when showing notes with "git log", or when
editing/appending notes with your default text editor. However, these
are typically bypassed and/or customized by an external application
storing custom data in notes.

-- 
Johan Herland, 
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] Supporting non-blob notes

2014-02-18 Thread Duy Nguyen
On Tue, Feb 18, 2014 at 9:46 PM, Johan Herland  wrote:
> On Mon, Feb 17, 2014 at 11:48 AM,   wrote:
>> The recent "git-note -C changes commit type?" thread
>> (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
>> like a good occasion to discuss possible uses of non-blob notes.
>>
>> The use-case we're thinking about is the storage of testrun logs as
>> notes (think: being able to justify that a given set of tests were
>> successfully run on a given revision).
>
> I think this is a good use of notes, and organizing the testrun logs
> into a tree of files seems like a natural way to proceed.

Notes from the previous attempt to store trees as notes (something to
watch out maybe, when you do it again)

http://article.gmane.org/gmane.comp.version-control.git/197712
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH] Supporting non-blob notes

2014-02-18 Thread Johan Herland
On Mon, Feb 17, 2014 at 11:48 AM,   wrote:
> The recent "git-note -C changes commit type?" thread
> (http://thread.gmane.org/gmane.comp.version-control.git/241950) looks
> like a good occasion to discuss possible uses of non-blob notes.
>
> The use-case we're thinking about is the storage of testrun logs as
> notes (think: being able to justify that a given set of tests were
> successfully run on a given revision).

I think this is a good use of notes, and organizing the testrun logs
into a tree of files seems like a natural way to proceed.

> Here is a proof-of-concept patch (that applies to 1.8.4.2) I've been
> playing with.  Because of the -C behaviour described in this other
> thread, I opted for a new -o flag that would not mess with the object
> argument.  This patch is very minimalist, and just allows storing a
> tree note (currently any type of object, but that's easy to restrict
> if we want to), and retrieving it.

I think we must think _very_ carefully about which object types we
allow to be stored in notes trees.

As far as I can see, you use case (storing testrun logs) is covered
nicely by allowing tree objects as notes, and I think that's where we
should start. The note tree is itself a tree object, and storing
sub-trees of that is not new or unusual to Git at all. Reachability is
nicely covered by how Git already handles sub-trees. Obviously we must
flesh out how the notes-related parts of the code deal with trees (see
below), but that does not really affect the rest of Git, and should
therefore be relatively uncontroversial.

If we go on to _commit_ objects, they are currently only referenced
from tree objects as "gitlink"s (with a special "16" mode). If you
were to put one of these in a notes tree, you would get the same
semantics as a "gitlink", i.e. git handles that part of the tree as a
submodule where a different submodule repo is (to be) checked out. The
commit is NOT considered/required to be reachable, and would therefore
not be automatically communicated by a fetch or push.

So if you want commits in a notes tree to be handled differently from
commits-as-gitlinks, you would have to tweak all the code in Git that
deal with gitlinks. You would have to introduce a differentiation
between your "commits-as-gitlinks" and "commits-as-notes", either by
reserving another special mode number, or by otherwise making the rest
of Git notes-aware. All of this comes in addition to teaching the
notes-related code how to deal with commits (i.e. how to display them,
etc.).

In other words, before you embark on this, you need a convincing
argument for why allowing commits-as-notes is really necessary and
worth it in the end. Please also consider that you _can_ support
commits-as-notes by the mechanism I suggested in the previous thread:
Store the commit SHA1 in a note-as-blob, and then amend the notes
commit to include the commit SHA1 as an additional parent. It's not
very elegant, but it solves the reachability problem.

If we go even further and want to allow ANY git object as a note, then
we must also consider tag objects, which AFAIK has never before been
stored inside a tree. Here we are really entering uncharted
territory...

So for now (and in lieu of a convincing use case for
notes-as-commits), I suggest you only look at notes-as-trees. The
first consequence of this is probably that your added -o/--object
option should be renamed. -t/--tree is not taken, AFAICS...

> Johan Herland wrote:
>> Obviously, it would not make sense to use refs/notes/history while
>> displaying the commit log ("git log --notes=history"), as the raw
>> commit object would be shown in the log.
>
> Currently, a non-blob commit is just not displayed at all.  And rather
> than displaying the raw object, we have a number of options available,
> starting with object's sha1, to more elaborate presentations depending
> on the type of object (commit info, tree hierarchy, etc, as "git notes
> show" already does).  This PoC shows that it can be dealt with later.

I'm only considering the notes-as-tree case here...

I assume that if you organize your notes in tree objects, then you
probably have more information in there than is useful to display in
the textual output from "git log". Also, you probably have
special-purpose scripts for initially generating those trees, and
later digging into the information stored therein. Hence we should
concentrate on getting the basics covered, to allow those scripts to
do their thing, and adding bells and whistles to "git log" for
displaying notes-as-trees is much less important. For now, "git log"
should probably show a short summary when encountering a
notes-as-tree. Whether that summary consists of merely the tree SHA1,
or in providing a (relatively short) tree listing, I leave up to you.
I also agree that this can be dealt with later (as long as the default
behaviour is not actively harmful/confusing).

> What I envision, would be viewers like gitk simply show the
> hyperlinked sha1, a