Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-04-05 Thread Duy Nguyen
On Sun, Apr 2, 2017 at 11:25 AM, Daniel Ferreira (theiostream)
 wrote:
> Why exactly would it not be applicable to read_directory_recursively()?

Because that function is a beast (and probably should have "beast" in
the function name).

The function is supposed to read .gitignore, index file and whatever
else needed, traverse the directory and return the list of untracked
files, or ignored files. It even has options to return a directory
path, if all entries inside is ignored/untracked, instead of the list
of files inside. On top of that, it has "untracked cache" to skip
traversing directories to speed up.

Using dir_iterator might be possible but we'll need to improve it a
lot. read_directory_recursively() needs really fine control over
traversing: it can decide to not recurse in a subdirectory, it can
decide to ignore the rest of the entries in a directory and go back to
its parent, and in the case of untracked cache, it can switch to
traversing the cache instead of on-disk directories.

It's possible, probably, but it's going to need quite a lot of work
(and care, since I think there are corner cases that have been abused
in real world and we just cannot change those behaviors). Using
iterator interface would be a good improvement to clean up the "on
disk or on cache" directory traversal there though.

>
> On Thu, Mar 30, 2017 at 8:08 AM, Duy Nguyen  wrote:
>> On Thu, Mar 30, 2017 at 1:39 PM, Michael Haggerty  
>> wrote:
>>> * DIR_ITERATOR_RECURSE -- recurse into subdirectories
>>>
>>> would make the set of possible options complete. If this option is not
>>> set, then the iteration would be over the entries in a single directory
>>> without traversing its subdirectories.
>>
>> And would make it possible to use dir-iterator everywhere (except
>> read_directory_recursiveky). That would be really nice :)
>> --
>> Duy



-- 
Duy


Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-04-01 Thread Daniel Ferreira (theiostream)
Why exactly would it not be applicable to read_directory_recursively()?

On Thu, Mar 30, 2017 at 8:08 AM, Duy Nguyen  wrote:
> On Thu, Mar 30, 2017 at 1:39 PM, Michael Haggerty  
> wrote:
>> * DIR_ITERATOR_RECURSE -- recurse into subdirectories
>>
>> would make the set of possible options complete. If this option is not
>> set, then the iteration would be over the entries in a single directory
>> without traversing its subdirectories.
>
> And would make it possible to use dir-iterator everywhere (except
> read_directory_recursiveky). That would be really nice :)
> --
> Duy


Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-30 Thread Junio C Hamano
Michael Haggerty  writes:

> I don't think any of this needs to be implemented now, but maybe keep it
> in mind if/when `dir_iterator` gets more users.

OK.  One thing that was missing in your list was the opposite of "do
not show directories", i.e. "show only directories".  That should
also be easy to do (but we need to figure out a way to encode it in
the flag somehow) if it turns out to be useful later.




Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-30 Thread Duy Nguyen
On Thu, Mar 30, 2017 at 1:39 PM, Michael Haggerty  wrote:
> * DIR_ITERATOR_RECURSE -- recurse into subdirectories
>
> would make the set of possible options complete. If this option is not
> set, then the iteration would be over the entries in a single directory
> without traversing its subdirectories.

And would make it possible to use dir-iterator everywhere (except
read_directory_recursiveky). That would be really nice :)
-- 
Duy


Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-30 Thread Michael Haggerty
On 03/30/2017 08:08 AM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
>> I think IN_ORDER really only applies to *binary* trees, not arbitrary
>> trees like a filesystem.
> 
> How true.  Even if we were giving a sorted output (and dir-iterator
> doesn't and there is no need for it to), dir/ should come before any
> of its contents, so for that application we can use pre-order, and
> there is no sensible and useful definition of in-order.

Your email got me thinking, though, that there is one generalization of
the concept of PRE_ORDER vs. POST_ORDER that would be both easy to
implement and potentially useful. Namely, flags could include the
following orthogonal options (instead of `DIR_ITERATOR_POST_ORDER)`:

* DIR_ITERATOR_DIRS_BEFORE -- when this is set, directories
  are included in the iteration *before* their contents.

* DIR_ITERATOR_DIRS_AFTER -- when this is set, directories
  are included in the iteration *after* their contents.

Enabling one or the other of these options would select pre-order or
post-order iteration.

Enabling neither of them would mean that directory entries themselves
are not included in the iteration at all, even though recursion would
happen *into* subdirectories. This option would surely be useful to some
caller somewhere (though it's easy for the caller to get the same effect
itself via

if (S_ISDIR(iter->base.st.st_mode))
continue;

).

It's even conceivable that enabling *both* options at the same time
would be useful, if the caller want to know when the processing of a
directory is begun and also when it is finished (e.g., because it needs
to load or unload a `.gitignore` file for that directory). If we wanted
to make it easier for the caller figure out whether it is seeing an
"entering directory" event vs. a "leaving directory" event, we could
expose something like the `dir_state` member in the iterator.

While we're blue-skying, a

* DIR_ITERATOR_RECURSE -- recurse into subdirectories

would make the set of possible options complete. If this option is not
set, then the iteration would be over the entries in a single directory
without traversing its subdirectories.

I don't think any of this needs to be implemented now, but maybe keep it
in mind if/when `dir_iterator` gets more users.

Michael



Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-30 Thread Junio C Hamano
Michael Haggerty  writes:

> I think IN_ORDER really only applies to *binary* trees, not arbitrary
> trees like a filesystem.

How true.  Even if we were giving a sorted output (and dir-iterator
doesn't and there is no need for it to), dir/ should come before any
of its contents, so for that application we can use pre-order, and
there is no sensible and useful definition of in-order.

Thanks.


Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-29 Thread Michael Haggerty
On 03/29/2017 06:46 PM, Junio C Hamano wrote:
> Michael Haggerty  writes:
> 
>> I also realize that I made a goof in my comments about v3 of this patch
>> series. Your new option is not choosing between "depth-first" and
>> "breadth-first". Both types of iteration are depth-first. Really it is
>> choosing between pre-order and post-order traversal. So I think it would
>> be better to name the option `DIR_ITERATOR_POST_ORDER`. Sorry about that.
> 
> That solicits a natural reaction from a bystander.  Would an
> IN_ORDER option also be useful?  I am not demanding it to be added
> to this series, especially if there is no immediate need, but if we
> foresee that it would also make sense for some other callers, we
> would at least want to make sure that the code after this addition
> of POST_ORDER is in a shape that is easy to add such an option
> later.

I think IN_ORDER really only applies to *binary* trees, not arbitrary
trees like a filesystem.

Michael



Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-29 Thread Junio C Hamano
Michael Haggerty  writes:

> I also realize that I made a goof in my comments about v3 of this patch
> series. Your new option is not choosing between "depth-first" and
> "breadth-first". Both types of iteration are depth-first. Really it is
> choosing between pre-order and post-order traversal. So I think it would
> be better to name the option `DIR_ITERATOR_POST_ORDER`. Sorry about that.

That solicits a natural reaction from a bystander.  Would an
IN_ORDER option also be useful?  I am not demanding it to be added
to this series, especially if there is no immediate need, but if we
foresee that it would also make sense for some other callers, we
would at least want to make sure that the code after this addition
of POST_ORDER is in a shape that is easy to add such an option
later.


Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-29 Thread Michael Haggerty
On 03/29/2017 11:56 AM, Michael Haggerty wrote:
> On 03/29/2017 02:32 AM, Daniel Ferreira wrote:
>> [...]
> [...]
> The disagreement is not a surprise, because there isn't a corresponding
> coding error in the code below that returns the directory itself in a
> post-order iteration. The net result appears to be that there is no
> recursion at all into subdirectories when `DIR_ITERATOR_DEPTH_FIRST` is
> set. So due to this bug, we get neither a correct post-order iteration
> nor a correct pre-order iteration with the new option.

Correction: the second-to-last sentence should read "there is no
recursion at all into subdirectories when `DIR_ITERATOR_DEPTH_FIRST` is
*NOT* set.

Michael



Re: [PATCH v4 2/5] dir_iterator: iterate over dir after its contents

2017-03-29 Thread Michael Haggerty
On 03/29/2017 02:32 AM, Daniel Ferreira wrote:
> Create an option for the dir_iterator API to iterate over subdirectories
> only after having iterated through their contents. This feature was
> predicted, although not implemented by 0fe5043 ("dir_iterator: new API
> for iterating over a directory tree", 2016-06-18).

I admire your ambition at taking on this project. Even though the
`dir_iterator` code is fairly short, it is quite intricate and it took
me quite a bit of thought to get it right. Don't be discouraged by how
many iterations it takes to get a change like this accepted.

When reviewing your changes, I realized that
`dir_iterator_level::initialized` and `dir_iterator_level::dir_state`
are somewhat redundant with each other in the pre-existing code. It
would be possible to add a third state, say `DIR_STATE_PUSH`, to the
enum, and use that to replace the state that we now call `!initialized`.

I'm not demanding that change, but it might make the bookkeeping in the
new code a little bit easier to understand, because the logic that
decides what to do based on setting of the new option would either
transition the level from `DIR_STATE_PUSH -> DIR_STATE_ITER ->
DIR_STATE_RECURSE` (for pre-order traversal) or `DIR_STATE_PUSH ->
DIR_STATE_RECURSE -> DIR_STATE_ITER` (for post-order traversal). I think
this could make the state machine clearer and thereby make it easier to
reason about the code.

> Add the "flags" parameter to dir_iterator_create, allowing for the
> aforementioned "depth-first" iteration mode to be enabled. Currently,
> the only acceptable flag is DIR_ITERATOR_DEPTH_FIRST.
> 
> This is useful for recursively removing a directory and calling rmdir()
> on a directory only after all of its contents have been wiped.

This patch changes the signature of `ref_iterator_begin()` without
adjusting the caller in `refs/files-backend.c`. This means that the code
doesn't even compile after this patch is applied.

The Git project insists that the code compile and all tests pass after
each and every commit. Among other things, this keeps the code
"bisectable" using `git bisect`, which is a very useful property.

I see that you adjust the caller later in the patch series, in
"files_reflog_iterator: amend use of dir_iterator". Please squash that
patch into this one.

I also realize that I made a goof in my comments about v3 of this patch
series. Your new option is not choosing between "depth-first" and
"breadth-first". Both types of iteration are depth-first. Really it is
choosing between pre-order and post-order traversal. So I think it would
be better to name the option `DIR_ITERATOR_POST_ORDER`. Sorry about that.

> ---
>  dir-iterator.c | 46 ++
>  dir-iterator.h | 14 +++---
>  2 files changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/dir-iterator.c b/dir-iterator.c
> index 853c040..545d333 100644
> --- a/dir-iterator.c
> +++ b/dir-iterator.c
> @@ -48,6 +48,9 @@ struct dir_iterator_int {
>* that will be included in this iteration.
>*/
>   struct dir_iterator_level *levels;
> +
> + /* Holds the flags passed to dir_iterator_begin(). */
> + unsigned flags;
>  };
> 
>  static inline void push_dir_level(struct dir_iterator_int *iter, struct 
> dir_iterator_level *level)
> @@ -114,12 +117,14 @@ int dir_iterator_advance(struct dir_iterator 
> *dir_iterator)
>   }
> 
>   level->initialized = 1;
> - } else if (S_ISDIR(iter->base.st.st_mode)) {
> + } else if (S_ISDIR(iter->base.st.st_mode) &&
> + !iter->flags & DIR_ITERATOR_DEPTH_FIRST) {

You need parentheses on the previous line:

!(iter->flags & DIR_ITERATOR_DEPTH_FIRST)) {

because otherwise it is interpreted as

(!iter->flags) & DIR_ITERATOR_DEPTH_FIRST) {

BTW, you should compile with stricter compiler options so that your
compiler warns you about problems like this. This is documented in
`Documentation/CodingGuidelines` (along with lots more useful information):

 - As a Git developer we assume you have a reasonably modern compiler
   and we recommend you to enable the DEVELOPER makefile knob to
   ensure your patch is clear of all compiler warnings we care about,
   by e.g. "echo DEVELOPER=1 >>config.mak".

Also, this line is not indented far enough. It should probably line up
with the inside of the opening parenthesis of the preceding line, like

> + } else if (S_ISDIR(iter->base.st.st_mode) &&
> +!(iter->flags & DIR_ITERATOR_DEPTH_FIRST)) {

But let's dig a little bit deeper. Why don't the tests catch this coding
error? Currently there is only one option, `DIR_ITERATOR_DEPTH_FIRST`,
so if that constant were defined to be 1 as expected, the code would
accidentally work anyway (though it would break if another flag is ever
added). But in fact, you define `DIR_ITERATOR_DEPTH_FIRST` to be 2
(probably unintentionally; see