Re: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories

2016-07-29 Thread Duy Nguyen
On Thu, Jul 28, 2016 at 10:33 PM, Junio C Hamano  wrote:
> Duy Nguyen  writes:
>
>>> 4. Fsck complains about missing blobs. Should be fairly easy to fix.
>>
>> Not really. You'll have to associate path information with blobs
>> before you decide that a blob should exist or not.
>
> Also the same blob or the tree can exist both inside and outside the
> narrowed area, as people reorganize their trees all the time.  I am
> not quite convinced a path-based approach (either yours or Robin's)
> is workable in the longer term.

I think it should be ok. What I meant was when we travel the trees to
find connected blobs, if a tree points to paths outside the narrowed
area, we do not add those blobs to our fsck list. Trees inside the
narrow area work as usually so those shared blobs are added to fsck
list anyway. Object islands are going to be problem (because we can't
assign paths to them)...
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories

2016-07-28 Thread Junio C Hamano
Duy Nguyen  writes:

>> 4. Fsck complains about missing blobs. Should be fairly easy to fix.
>
> Not really. You'll have to associate path information with blobs
> before you decide that a blob should exist or not.

Also the same blob or the tree can exist both inside and outside the
narrowed area, as people reorganize their trees all the time.  I am
not quite convinced a path-based approach (either yours or Robin's)
is workable in the longer term.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories

2016-07-28 Thread Duy Nguyen
Corrections..

On Thu, Jul 28, 2016 at 6:59 PM, Duy Nguyen  wrote:
> Ah.. this is what I call narrow checkout [1] (but gmane is down at the moment)

s/checkout/clone/

> [2] https://github.com/pclouds/git/commits/lanh/narrow-checkout

s,lanh/,,
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/7] Add possibility to clone specific subdirectories

2016-07-28 Thread Duy Nguyen
On Thu, Jul 28, 2016 at 6:02 PM, Robin Ruede  wrote:
> This patch series adds a `--sparse-prefix=` option to multiple commands,
> allowing fetching repository contents from only a subdirectory of a remote.
>
> This works along with sparse-checkout, and is especially useful for 
> repositories
> where a subdirectory has meaning when standing alone.

Ah.. this is what I call narrow checkout [1] (but gmane is down at the moment)

[1] http://thread.gmane.org/gmane.comp.version-control.git/155427

> * Motivation (example use cases)
>
> ...

nods nods.. all good stuff

> * Open problems:
>
> 1. Currently all trees are still included. It would be possible to
> include only the trees relevant to the sparse files, which would significantly
> reduce the pack sizes for repositories containing a lot of small files 
> changing
> often. For example package managers using git. Not sure in how many places all
> trees are presumed present.

You can limit some trees by passing a pathspec to "git rev-list" (in
your "list-objects" patch). All trees completely outside sub/dir will
be excluded. Trees leading to it (e.g. root tree and "sub") are still
included. Not having all trees open up a new set of problems.. This is
what I did in narrow clone: pass some directories (as pathspec) to
rev-list on the server side, then deal with lack of trees on client
side.

> 2. This patch set implements it as a simple single prefix check command line
> option.
> Using the exclude_list format (same as in sparse-checkout) might be useful.
> The server needs to check these patterns for all files in history, so I'm not
> sure if allowing multiple/complex patterns is a good idea.

I would go with something else than sparse-checkout, which I call
narrow checkout: instead of flattening the entire tree in index and
keep only files there, we keep trees that we don't have as trees.
Those trees have the same "sparse checkout" attributes, e.g. ignore
worktree and some of submodules e.g. don't bother checking the
associated hash. This approach [2] eliminates changes in cache-tree.c
(i.e. 3/7).

And you would need something like that, when you don't have all the
trees (from open problem 1), because you just can't flatten trees when
you don't have them.

[2] https://github.com/pclouds/git/commits/lanh/narrow-checkout (I
think core functionality is in place, but narrow operation still needs
more work)

> 3. This patch set assumes the sparse-prefix and sparse-checkout does not 
> change.
> running clone and fetch both need to have the --sparse-prefix= option, 
> otherwise
> complete packs will be fetched. Not sure what the best way to store the
> information is, possibly create a new file `.git/sparse` similar to
> `.git/shallow` containing the path(s).

Something like .git/shallow, yes. It's similar in nature anyway
(shallow cuts depth, you cut the side)

> 3. Bitmap indices cannot be used, because they do not contain the paths of the
> objects. So for creating packs, the whole DAG has to be walked.

And shallow clones have this same problem. Something to be sorted out :)

> 4. Fsck complains about missing blobs. Should be fairly easy to fix.

Not really. You'll have to associate path information with blobs
before you decide that a blob should exist or not. Sparse patterns are
just not designed for that (tree walking). If you narrow (heh) down to
just path prefix not full blown sparse patterns, then it's feasible to
walk tree and filter. A subset of pathspec would be good because we
can already filter by pathspec, but I would not go full pathspec at
the first step.

> 5. Tests and documentation is missing.

Personally I would go with my narrow clone approach, but the ability
to selectively exclude some large blobs is still good, I think.
However, another approach to excluding some blobs is the external
object database [3]. It gives you what you need with a lot less code
impact (but you will not be able to work offline 100% the time like
what you can now with git)

[3] 
https://public-inbox.org/git/20160613085546.11784-1-chriscool%40tuxfamily.org/
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html