subject:"Re\: \[PATCH 02\/13\] list\-objects\-filter\-map\: extend oidmap to collect omitted objects"

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

2017-10-25 Thread Junio C Hamano

Jeff Hostetler  writes:

> Sorry, I meant a later commit in this patch series.  It is used by
> commits 4, 5, 6, and 10 to actually do the filtering and collect a
> list of omitted or missing objects.

I know you meant "later commits in the series" ;-).  

It does not change the fact that readers of 02/13 haven't seen them
yet to understand patch 02/13, if the changes that drove the design
of this step is in the same series or if they are not yet posted.

> I think of a "set" as a member? or not-member? class.
> I think of a "map" as a member? or not-member? class but where each
> member also has a value.  Sometimes map lookups just want to know
> membership and sometimes the lookup wants the value.
>
> Granted, having the key and value data stuffed into the same entry
> (from hashmap's point of view, rather than a key having a pointer
> to a value) does kind of blur the line, but I was thinking about
> a map here.  (And I was building on oidmap which builds on hashmap,
> so it seemed appropriate.)

My question was mostly about "if this is a map, then a caller that
queries the map with an oid does so because it wants to know the
data associated to the oid; if this is just a set, it is mostly
interested in the membership" and "I cannot quite tell which was
meant without the caller".  

It seems that some callers do care about the "path" name from your
response above, so calling this "map" sounds more appropriate.

The answer "it can be used to speed up 'is this path excluded?'
check" is a bit worrisome, though.  A blob can appear at more than
one path, and unless all the appearances of it are in an excluded
path, omitting the blob from the repository would lead to an aborted
"rev-list --objects" run, and this "map" can record at most one path
per each object; we need to wait until seeing the optimization code
to actually see how effectively this data helps optimization and
comment on the code ;-)

>>> +   len = ((pathname && *pathname) ? strlen(pathname) : 0);
>>> +   size = (offsetof(struct list_objects_filter_map_entry, pathname) + len 
>>> + 1);
>>> +   e = xcalloc(1, size);
>>> +
>>> +   oidcpy(>entry.oid, oid);
>>> +   e->type = type;
>>> +   if (pathname && *pathname)
>>> +   strcpy(e->pathname, pathname);
>>> +
>>> +   oidmap_put(map, e);
>>> +   return 0;
>>> +}
>>
>> The return value from the function needs to be documented in the
>> header to help callers.  It is not apparent why "we did already have
>> one" and "we now newly added" is interesting to the callers, for
>> example.  An obvious alternative implementation of this function
>> would return the pointer to an entry that records the object id
>> (i.e. either the one that was already there, or the one we created
>> because we saw this object for the first time), so that the caller
>> can do something interesting to it---again, because the reason why
>> we want this "filter map" is not explained at this stage, it is hard
>> to tell what that "sometehing interesting" would be.
>
> good point.  thanks.

I am more confused by the response ;-) But as we established that
this is a map (not a set that borrows the implementation of map),
where the data recorded in 'e' is quite useful to the caller, it
probably makes sense to make 'e' available to the caller?  It is
still unclear if the caller finds "it is the first time I saw the
object you gave me" vs "I've seen that object before already"
useful.

>>> +   for (k = 0; k < nr; k++)
>>> +   cb(k, nr, array[k], cb_data);
>>
>> Also it is not clear if you wanted to expose the type of the
>> entry to the callback function.
>
> The thought was that we would sort the OIDs so that things
> like rev-list could print the omitted/missing objects in OID
> order.  Not critical that we do it here, but I thought it would
> help callers.

I can foresee some callers would want sorted, while others do not.
I was primarily wondering why "my_cmp" is not a parameter that can
be NULL (in which case we do not sort at all).

>> An obvious alternative
>>
>>  fn([k].entry.oid, cb_data);
>>
>> would allow you to keep the type of map-entry private to the map,
>> and also the callback does not need to know about k or nr.
>> ...
> I included the {k, nr} so that the callback could dump header/trailer
> information when reporting the results or pre-allocate an array.
> I'll look at refactoring this -- I never quite liked how it turned
> out anyway -- especially with the oidmap simplifications.

And as we established that this is a map, where the data associated
with each oid is interesting to the caller, we do not want to hide
the type of array[] element by passing only [k].entry.oid, I
guess?

Thanks.

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

2017-10-25 Thread Jeff Hostetler




On 10/25/2017 3:10 AM, Junio C Hamano wrote:

Jeff Hostetler  writes:


From: Jeff Hostetler 

Create helper class to extend oidmap to collect a list of
omitted or missing objects during traversal.


The reason why oidmap itself cannot be used is because the code
wants to record not just the object name but something else about
the object.  And attributes that the code may care about we can see
in this patch are the object type and the path it found.


I recently simplified the code in this version to not completely
sub-class oidmap, but to just use it along with a custom
_insert method that takes care of allocating the _entry
data.  I should update the commit message to reflect that.



Is the plan to extend this set of attributes over time as different
"omitter"s are added?  Why was "path" chosen as a member of the
initial set and how it will be useful (also, what path would we
record for tags and commits)?


I envisioned this to let rev-list print the pathname of omitted
objects -- like "rev-list --objects" does for regular blobs.
I would leave the pathname NULL for tags and commits.

The pathname helps with debugging and testing, but also is
used by the sparse filter to avoid some expensive duplicate
is-excluded lookups.

Currently the 3 filters I have defined all use the same extra
data.  I suppose a future filter could want additional fields,
so maybe it would be better to refactor my "map-entry" to be
per-filter specific.



These "future plans" needs revealed upfront, instead of (or in
addition to) "will be used in a later commit".  As it is hard to
judge if "filter map" is an appropriate name for this thing without
knowing _how_ it is envisioned to be used.  "filter map" sounds more
like a map function that is consulted when we decide if we want to
drop the object, but from the looks of the code, it is used more to
record what was done to these objects.


Sorry, I meant a later commit in this patch series.  It is used by
commits 4, 5, 6, and 10 to actually do the filtering and collect a
list of omitted or missing objects.



Is it really a "map" (i.e. whose primary focus is to find out what
an object name is "mapped to" when we get an object name---e.g. we
notice an otherwise connected object is missing, and consult this
"map" to learn what the type/path is because we want to do X)?  Or
is it more like a "set of known-to-be-missing object" (i.e. whose
primary point is to serve as a set of object names and what a name
maps to is primarily for debugging)?  These are easier to answer if
we know how it will be used.


I think of a "set" as a member? or not-member? class.
I think of a "map" as a member? or not-member? class but where each
member also has a value.  Sometimes map lookups just want to know
membership and sometimes the lookup wants the value.

Granted, having the key and value data stuffed into the same entry
(from hashmap's point of view, rather than a key having a pointer
to a value) does kind of blur the line, but I was thinking about
a map here.  (And I was building on oidmap which builds on hashmap,
so it seemed appropriate.)




This will be used in a later commit by the list-object filtering
code.

Signed-off-by: Jeff Hostetler 
---
diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
new file mode 100644
index 000..7e496b3
--- /dev/null
+++ b/list-objects-filter-map.c
@@ -0,0 +1,63 @@
+#include "cache.h"
+#include "list-objects-filter-map.h"
+
+int list_objects_filter_map_insert(struct oidmap *map,
+  const struct object_id *oid,
+  const char *pathname, enum object_type type)
+{
+   size_t len, size;
+   struct list_objects_filter_map_entry *e;
+
+   if (oidmap_get(map, oid))
+   return 1;


It is OK for the existing entry to record a path that is totally
different from what the caller has.  It is hard to judge without
knowing what pathname the callers are expected to call this function
with, but I am guessing that it is similar to the path shown in the
output from "rev-list --objects"---and if that is the case, it is
correct that the same object may be reached at different paths
depending on what tree the traversal begins at, so pathname recorded
in the map is merely "there is one tree somewhere that has this
object at this path".


Right, the first observed pathname is as good as any.



For that matter, the caller may have a completely different type
from the object we saw earlier; not checking and flagging it as a
possible error makes me feel somewhat uneasy, but there probably is
little you can do at this layer of the code if you noticed such a
discrepancy so it may be OK to punt.


I could assert() that the types match, but right there's not much
we can do about it at this layer.




+   len = ((pathname && *pathname) ? strlen(pathname) : 0);
+   size = (offsetof(struct

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

2017-10-25 Thread Junio C Hamano

Jeff Hostetler  writes:

> From: Jeff Hostetler 
>
> Create helper class to extend oidmap to collect a list of
> omitted or missing objects during traversal.

The reason why oidmap itself cannot be used is because the code
wants to record not just the object name but something else about
the object.  And attributes that the code may care about we can see
in this patch are the object type and the path it found.  

Is the plan to extend this set of attributes over time as different
"omitter"s are added?  Why was "path" chosen as a member of the
initial set and how it will be useful (also, what path would we
record for tags and commits)?

These "future plans" needs revealed upfront, instead of (or in
addition to) "will be used in a later commit".  As it is hard to
judge if "filter map" is an appropriate name for this thing without
knowing _how_ it is envisioned to be used.  "filter map" sounds more
like a map function that is consulted when we decide if we want to
drop the object, but from the looks of the code, it is used more to
record what was done to these objects.

Is it really a "map" (i.e. whose primary focus is to find out what
an object name is "mapped to" when we get an object name---e.g. we
notice an otherwise connected object is missing, and consult this
"map" to learn what the type/path is because we want to do X)?  Or
is it more like a "set of known-to-be-missing object" (i.e. whose
primary point is to serve as a set of object names and what a name
maps to is primarily for debugging)?  These are easier to answer if
we know how it will be used.

> This will be used in a later commit by the list-object filtering
> code.
>
> Signed-off-by: Jeff Hostetler 
> ---
> diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
> new file mode 100644
> index 000..7e496b3
> --- /dev/null
> +++ b/list-objects-filter-map.c
> @@ -0,0 +1,63 @@
> +#include "cache.h"
> +#include "list-objects-filter-map.h"
> +
> +int list_objects_filter_map_insert(struct oidmap *map,
> +const struct object_id *oid,
> +const char *pathname, enum object_type type)
> +{
> + size_t len, size;
> + struct list_objects_filter_map_entry *e;
> +
> + if (oidmap_get(map, oid))
> + return 1;

It is OK for the existing entry to record a path that is totally
different from what the caller has.  It is hard to judge without
knowing what pathname the callers are expected to call this function
with, but I am guessing that it is similar to the path shown in the
output from "rev-list --objects"---and if that is the case, it is
correct that the same object may be reached at different paths
depending on what tree the traversal begins at, so pathname recorded
in the map is merely "there is one tree somewhere that has this
object at this path".

For that matter, the caller may have a completely different type
from the object we saw earlier; not checking and flagging it as a
possible error makes me feel somewhat uneasy, but there probably is
little you can do at this layer of the code if you noticed such a
discrepancy so it may be OK to punt.

> + len = ((pathname && *pathname) ? strlen(pathname) : 0);
> + size = (offsetof(struct list_objects_filter_map_entry, pathname) + len 
> + 1);
> + e = xcalloc(1, size);
> +
> + oidcpy(>entry.oid, oid);
> + e->type = type;
> + if (pathname && *pathname)
> + strcpy(e->pathname, pathname);
> +
> + oidmap_put(map, e);
> + return 0;
> +}

The return value from the function needs to be documented in the
header to help callers.  It is not apparent why "we did already have
one" and "we now newly added" is interesting to the callers, for
example.  An obvious alternative implementation of this function
would return the pointer to an entry that records the object id
(i.e. either the one that was already there, or the one we created
because we saw this object for the first time), so that the caller
can do something interesting to it---again, because the reason why
we want this "filter map" is not explained at this stage, it is hard
to tell what that "sometehing interesting" would be.

> +static int my_cmp(const void *a, const void *b)
> +{
> + const struct oidmap_entry *ea, *eb;
> +
> + ea = *(const struct oidmap_entry **)a;
> + eb = *(const struct oidmap_entry **)b;
> +
> + return oidcmp(>oid, >oid);
> +}
> +
> +void list_objects_filter_map_foreach(struct oidmap *map,
> +  list_objects_filter_map_foreach_cb cb,

Name a typedef of a function as something_fn, not something_cb;
something_cb is often the type of a struct to be fed to the callback
function.  And call such a parameter of type something_fn just fn.

> +  void *cb_data)
> +{
> + struct hashmap_iter iter;
> + struct list_objects_filter_map_entry **array;
> +

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects

3 matches

Site Navigation

Mail list logo

Footer information