Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-25 Thread Brandon Williams
On 01/25, Brandon Williams wrote:
> On 01/25, Junio C Hamano wrote:
> > Brandon Williams  writes:
> > 
> > >> In my mind, the value of having a constant check_attr is primarily
> > >> that it gives us a stable pointer to serve as a hashmap key,
> > >> i.e. the identifier for each call site, in a later iteration.
> > >
> > > We didn't really discuss this notion of having the pointer be a key into
> > > a hashmap, what sort of information are you envisioning being stored in
> > > this sort of hashmap?
> > 
> > The "entries relevant to this attr_check() call, that is specific to
> > the  tuple" (aka "what used to be
> > called the global attr_stack") we discussed would be the primary
> > example.  A thread is likely be looping in a caller that has many
> > paths inside a directory, calling a function that has a call to
> > attr_check() for each path.  Having something that can use to
> > identify the check_attr instance in a stable way, even when the
> > inner function is called and returns many times, would allow us to
> > populate the "attr stack" just once for the thread when it enters a
> > directory for the first time (remember, another thread may be
> > executing the same codepath, checking for paths in a different
> > directory) and keep using it.  There may be other mechanisms you can
> > come up with, so I wouldn't say it is the only valid way, but it is
> > a way.  That is why I said:
> > 
> > >> But all of the above comes from my intuition, and I'll very much
> > >> welcome to be proven wrong with an alternative design, or better
> > >> yet, a working code based on an alternative design ;-).
> > 
> > near the end of my message.
> > 
> > > One issue I can see with this is that the
> > > functions which have a static attr_check struct would still not be thread
> > > safe if the initialization of the structure isn't surrounded by a mutex
> > > itself. ie
> > 
> > Yes, that goes without saying.  That is why I suggested Stefan to do
> > not this:
> > 
> > > static struct attr_check *check;
> > >
> > > if (!check)
> > >   init(check);
> > >
> > > would need to be:
> > >
> > > lock()
> > > if (!check)
> > >   init(check);
> > > unlock();
> > 
> > but this:
> > 
> > static struct attr_check *check;
> > init();
> > 
> > and hide the lock/unlock gymnastics inside the API.  I thought that
> > already was in what you inherited from him and started your work
> > on top of?
> 
> I essentially built off of the series you had while using Stefan's
> patches as inspiration, but I don't believe the kind of mechanism you
> are describing existed in Stefan's series.  His series had a single lock
> for the entire system, only allowing a single caller to be in it at any
> given time.  This definitely isn't ideal, hence why I picked it up.
> 
> Implementation aside I want to try and nail down what the purpose of
> this refactor is.  There are roughly two notions of being "thread-safe".
> 
> 1. The first is that the subsystem itself is thread safe, that is
>multiple threads can be executing inside the subsystem without stepping
>on each others work.
> 
> 2. The second is that the object itself is thread safe or that multiple
>threads can use the same object.
> 
> I thought that the main purpose of this was to achieve (1) since
> currently that is not the case.

Ok, so I discovered a very good reason why we should do as Stefan
originally did and split the question and answer (beyond the reasoning
for using the reference as a hashkey).

One motivation behind making this API thread-safe is that we can use it
in pathspec code to match against attributes.  This means that a
pathspec structure will contain an attr_check member describing the
attributes that a pathspec item is interested in.  Then the pathspec
structure is passed to match_pathspec() as a const pointer.  To me, when
passing something as 'const' I expect none of the members should change
at all.  The struct should remain exactly in the same form as before I
invoked the function.

Requiring the attr_check structure to be modified in the process of a
check_attrs() call would violate this "contract" when calling
match_pathspec() as the attr_check structure would have modified state.
The compiler wouldn't catch this as the "const" modifier isn't passed on
to struct members.

-- 
Brandon Williams


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-25 Thread Brandon Williams
On 01/25, Junio C Hamano wrote:
> Brandon Williams  writes:
> 
> >> In my mind, the value of having a constant check_attr is primarily
> >> that it gives us a stable pointer to serve as a hashmap key,
> >> i.e. the identifier for each call site, in a later iteration.
> >
> > We didn't really discuss this notion of having the pointer be a key into
> > a hashmap, what sort of information are you envisioning being stored in
> > this sort of hashmap?
> 
> The "entries relevant to this attr_check() call, that is specific to
> the  tuple" (aka "what used to be
> called the global attr_stack") we discussed would be the primary
> example.  A thread is likely be looping in a caller that has many
> paths inside a directory, calling a function that has a call to
> attr_check() for each path.  Having something that can use to
> identify the check_attr instance in a stable way, even when the
> inner function is called and returns many times, would allow us to
> populate the "attr stack" just once for the thread when it enters a
> directory for the first time (remember, another thread may be
> executing the same codepath, checking for paths in a different
> directory) and keep using it.  There may be other mechanisms you can
> come up with, so I wouldn't say it is the only valid way, but it is
> a way.  That is why I said:
> 
> >> But all of the above comes from my intuition, and I'll very much
> >> welcome to be proven wrong with an alternative design, or better
> >> yet, a working code based on an alternative design ;-).
> 
> near the end of my message.
> 
> > One issue I can see with this is that the
> > functions which have a static attr_check struct would still not be thread
> > safe if the initialization of the structure isn't surrounded by a mutex
> > itself. ie
> 
> Yes, that goes without saying.  That is why I suggested Stefan to do
> not this:
> 
> > static struct attr_check *check;
> >
> > if (!check)
> >   init(check);
> >
> > would need to be:
> >
> > lock()
> > if (!check)
> >   init(check);
> > unlock();
> 
> but this:
> 
>   static struct attr_check *check;
>   init();
> 
> and hide the lock/unlock gymnastics inside the API.  I thought that
> already was in what you inherited from him and started your work
> on top of?

I essentially built off of the series you had while using Stefan's
patches as inspiration, but I don't believe the kind of mechanism you
are describing existed in Stefan's series.  His series had a single lock
for the entire system, only allowing a single caller to be in it at any
given time.  This definitely isn't ideal, hence why I picked it up.

Implementation aside I want to try and nail down what the purpose of
this refactor is.  There are roughly two notions of being "thread-safe".

1. The first is that the subsystem itself is thread safe, that is
   multiple threads can be executing inside the subsystem without stepping
   on each others work.

2. The second is that the object itself is thread safe or that multiple
   threads can use the same object.

I thought that the main purpose of this was to achieve (1) since
currently that is not the case.

-- 
Brandon Williams


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-25 Thread Junio C Hamano
Brandon Williams  writes:

>> In my mind, the value of having a constant check_attr is primarily
>> that it gives us a stable pointer to serve as a hashmap key,
>> i.e. the identifier for each call site, in a later iteration.
>
> We didn't really discuss this notion of having the pointer be a key into
> a hashmap, what sort of information are you envisioning being stored in
> this sort of hashmap?

The "entries relevant to this attr_check() call, that is specific to
the  tuple" (aka "what used to be
called the global attr_stack") we discussed would be the primary
example.  A thread is likely be looping in a caller that has many
paths inside a directory, calling a function that has a call to
attr_check() for each path.  Having something that can use to
identify the check_attr instance in a stable way, even when the
inner function is called and returns many times, would allow us to
populate the "attr stack" just once for the thread when it enters a
directory for the first time (remember, another thread may be
executing the same codepath, checking for paths in a different
directory) and keep using it.  There may be other mechanisms you can
come up with, so I wouldn't say it is the only valid way, but it is
a way.  That is why I said:

>> But all of the above comes from my intuition, and I'll very much
>> welcome to be proven wrong with an alternative design, or better
>> yet, a working code based on an alternative design ;-).

near the end of my message.

> One issue I can see with this is that the
> functions which have a static attr_check struct would still not be thread
> safe if the initialization of the structure isn't surrounded by a mutex
> itself. ie

Yes, that goes without saying.  That is why I suggested Stefan to do
not this:

> static struct attr_check *check;
>
> if (!check)
>   init(check);
>
> would need to be:
>
> lock()
> if (!check)
>   init(check);
> unlock();

but this:

static struct attr_check *check;
init();

and hide the lock/unlock gymnastics inside the API.  I thought that
already was in what you inherited from him and started your work
on top of?



Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-25 Thread Stefan Beller
On Wed, Jan 25, 2017 at 11:57 AM, Brandon Williams  wrote:
> On 01/23, Junio C Hamano wrote:
>> Brandon Williams  writes:
>>
>> > ... It seems like breaking the question and answer up
>> > doesn't buy you much in terms of reducing allocation churn and instead
>> > complicates the API with needing to keep track of two structures instead
>> > of a one.
>>
>> In my mind, the value of having a constant check_attr is primarily
>> that it gives us a stable pointer to serve as a hashmap key,
>> i.e. the identifier for each call site, in a later iteration.
>
> We didn't really discuss this notion of having the pointer be a key into
> a hashmap, what sort of information are you envisioning being stored in
> this sort of hashmap?  One issue I can see with this is that the
> functions which have a static attr_check struct would still not be thread
> safe if the initialization of the structure isn't surrounded by a mutex
> itself. ie
>
> static struct attr_check *check;
>
> if (!check)
>   init(check);
>
> would need to be:
>
> lock()
> if (!check)
>   init(check);
> unlock();
>
> inorder to prevent a race to initialize the structure.  Which is
> something that the attr system itself can't be refactored to fix (at
> least I can't see how at the moment).

By passing the check pointer into the attr system (using a double pointer)

extern void git_attr_check_initl( \
struct git_attr_check out**, \
const char *, ...)
{
// get the global lock, as construction of new check structs
// is not expected to produce contention

// parse the list of things & construct the thing

*out = /* I made a thing */
// unlock globally
}

>
>> Of course, in order to populate the "question" array, we'd need the
>> interning of attribute names to attr objects, which need to be
>> protected by mutex, and you would probably not want to do that every
>> time the control hits the codepath.
>
> While true that doesn't prevent the mutex needed to create/check that
> the all_attr array that is used to collect attributes is the correct
> size/initialized properly.
>
>> But all of the above comes from my intuition, and I'll very much
>> welcome to be proven wrong with an alternative design, or better
>> yet, a working code based on an alternative design ;-).
>
> Yeah, after working through the problem the two simple solutions I can
> think of are either my v1 or v2 of the series, neither of which allows
> for the attr_check structure to be shared.  If we truly want the
> "question" array to be const then that can be done, it would just
> require a bit more boilerplate and making the all_attr array to be
> local to the check_attrs() function itself.  An API like this would look
> like:
>
> static const struct attr_check *check;
> struct attr_result result;
>
> if (!check)
>   init_check(check);
>
> // Result struct needs to be initialized based on the size of check
> init_result(, check);

Behind the scenes we may have a pool that caches result allocations,
such that we avoid memory allocation churn in here


>
> check_attrs(path, check, );
>
> // use result
>
> attr_result_clear();

Instead of clearing here, we'd give it back to the pool, which then can keep
parts of the result intact.


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-25 Thread Brandon Williams
On 01/23, Junio C Hamano wrote:
> Brandon Williams  writes:
> 
> > ... It seems like breaking the question and answer up
> > doesn't buy you much in terms of reducing allocation churn and instead
> > complicates the API with needing to keep track of two structures instead
> > of a one.
> 
> In my mind, the value of having a constant check_attr is primarily
> that it gives us a stable pointer to serve as a hashmap key,
> i.e. the identifier for each call site, in a later iteration.

We didn't really discuss this notion of having the pointer be a key into
a hashmap, what sort of information are you envisioning being stored in
this sort of hashmap?  One issue I can see with this is that the
functions which have a static attr_check struct would still not be thread
safe if the initialization of the structure isn't surrounded by a mutex
itself. ie

static struct attr_check *check;

if (!check)
  init(check);

would need to be:

lock()
if (!check)
  init(check);
unlock();

inorder to prevent a race to initialize the structure.  Which is
something that the attr system itself can't be refactored to fix (at
least I can't see how at the moment).

> Of course, in order to populate the "question" array, we'd need the
> interning of attribute names to attr objects, which need to be
> protected by mutex, and you would probably not want to do that every
> time the control hits the codepath.

While true that doesn't prevent the mutex needed to create/check that
the all_attr array that is used to collect attributes is the correct
size/initialized properly.

> But all of the above comes from my intuition, and I'll very much
> welcome to be proven wrong with an alternative design, or better
> yet, a working code based on an alternative design ;-).

Yeah, after working through the problem the two simple solutions I can
think of are either my v1 or v2 of the series, neither of which allows
for the attr_check structure to be shared.  If we truly want the
"question" array to be const then that can be done, it would just
require a bit more boilerplate and making the all_attr array to be
local to the check_attrs() function itself.  An API like this would look
like:

static const struct attr_check *check;
struct attr_result result;

if (!check)
  init_check(check);

// Result struct needs to be initialized based on the size of check
init_result(, check);

check_attrs(path, check, );

// use result

attr_result_clear();

>

It still doesn't handle an initialization race on the check structure
but the check pointer would be const and could be used for some future
optimization.  It also will have a bit more allocation churn than either
v1 or v2 of the series.  If this is the route you want to take I'll get
working on it, I just want to make sure we're on the same page before
doing a larger refactor like this.

Thanks for the guidance on this, someday we'll get this right :)

-- 
Brandon Williams


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-23 Thread Junio C Hamano
Brandon Williams  writes:

> ... It seems like breaking the question and answer up
> doesn't buy you much in terms of reducing allocation churn and instead
> complicates the API with needing to keep track of two structures instead
> of a one.

In my mind, the value of having a constant check_attr is primarily
that it gives us a stable pointer to serve as a hashmap key,
i.e. the identifier for each call site, in a later iteration.

Of course, in order to populate the "question" array, we'd need the
interning of attribute names to attr objects, which need to be
protected by mutex, and you would probably not want to do that every
time the control hits the codepath.

But all of the above comes from my intuition, and I'll very much
welcome to be proven wrong with an alternative design, or better
yet, a working code based on an alternative design ;-).


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-23 Thread Brandon Williams
On 01/23, Brandon Williams wrote:
> As we discussed off-line I'll also do the rework to break up the
> question and result.  That way two threads can be executing using the
> same attr_check structure.

Thinking about this I don't really see what we would gain by breaking
them up.

Right now most callers have a static attr_check struct which holds the
question and answer (and in my series a buffer of all_attrs used during
the collection process).  If this struct is broken up into question and
answer then the only part of it that can be shared with multiple threads
is the question, which ends up being an array with 2 maybe 3 entries on
average.  The result and the array of all_attrs would then need to be
allocated each time calling into the attribute system since they can't
be shared.  Since this allocation is already going to happen wouldn't it
just make sense to drop the static modifier on the check structure (or
have a per-thread check structure) if you really wanted a particular
function thread safe?  It seems like breaking the question and answer up
doesn't buy you much in terms of reducing allocation churn and instead
complicates the API with needing to keep track of two structures instead
of a one.

-- 
Brandon Williams


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-23 Thread Brandon Williams
On 01/23, Junio C Hamano wrote:
> Brandon Williams  writes:
> 
> > The last big hurdle towards a thread-safe API for the attribute system
> > is the reliance on a global attribute stack that is modified during each
> > call into the attribute system.
> 
> The same comment as 22/27 applies here.  
> 
> It is not an immediate problem we need to solve in the scope of this
> series, in the sense that a Big Subsystem Lock for the attribute
> subsystem around git_check_attr() function can make it thread-safe.
> 
> But if we want to make it truly threadable without a Big Subsystem
> Lock, this and the other one would need to become per-thread at
> least.  I think the check_all_attrs scoreboard, which is the topic
> of 22/27, should become per git_check_attr() invocation (immediately
> before making a call to collect_some_attrs(), prepare an array with
> map.size elements and use that as a scoreboard, for example).  I do
> not think we can be sure that the "slimmed down attr stack" 15/27
> envisions would help performance without benchmarking, but if it
> does, then the "attr stack that holds entries that are relevant to
> the current query" would have to become per  pair, as
> two threads may be executing the same codepath looking for the same
> set of attributes (i.e. sharing a single attr_check instance), but
> working on two different parts of a tree structure.
> 
> > This patch removes this global stack and instead a stack is stored
> > locally in each attr_check instance.  This opens up the opportunity for
> > future optimizations to customize the attribute stack for the attributes
> > that a particular attr_check struct is interested in.
> 
> This is still true.  But two threads hitting the same attr_check
> would make the stack thrash between the paths they are working on to
> hurt performance once we go multi-threaded.
> 
> Perhaps, provided if the "slimmed down attr stack" is indeed a good
> idea, we should keep the global hashmap that holds everything we
> read from .gitattributes tree-wide (i.e. as in your v1), _and_
> introduce a mechanism to keep the slimmed down version that is
> relevant to check[] for each thread somehow.

Sounds good,  I'll reintroduce the hashmap of stacks that I had in v1
and instead make the all_attrs array that is used the in collection
process allocated at invocation time.  That will cause a bit of
allocation churn but in reality shouldn't make that much of an impact.

As we discussed off-line I'll also do the rework to break up the
question and result.  That way two threads can be executing using the
same attr_check structure.

-- 
Brandon Williams


Re: [PATCH v2 25/27] attr: store attribute stack in attr_check structure

2017-01-23 Thread Junio C Hamano
Brandon Williams  writes:

> The last big hurdle towards a thread-safe API for the attribute system
> is the reliance on a global attribute stack that is modified during each
> call into the attribute system.

The same comment as 22/27 applies here.  

It is not an immediate problem we need to solve in the scope of this
series, in the sense that a Big Subsystem Lock for the attribute
subsystem around git_check_attr() function can make it thread-safe.

But if we want to make it truly threadable without a Big Subsystem
Lock, this and the other one would need to become per-thread at
least.  I think the check_all_attrs scoreboard, which is the topic
of 22/27, should become per git_check_attr() invocation (immediately
before making a call to collect_some_attrs(), prepare an array with
map.size elements and use that as a scoreboard, for example).  I do
not think we can be sure that the "slimmed down attr stack" 15/27
envisions would help performance without benchmarking, but if it
does, then the "attr stack that holds entries that are relevant to
the current query" would have to become per  pair, as
two threads may be executing the same codepath looking for the same
set of attributes (i.e. sharing a single attr_check instance), but
working on two different parts of a tree structure.

> This patch removes this global stack and instead a stack is stored
> locally in each attr_check instance.  This opens up the opportunity for
> future optimizations to customize the attribute stack for the attributes
> that a particular attr_check struct is interested in.

This is still true.  But two threads hitting the same attr_check
would make the stack thrash between the paths they are working on to
hurt performance once we go multi-threaded.

Perhaps, provided if the "slimmed down attr stack" is indeed a good
idea, we should keep the global hashmap that holds everything we
read from .gitattributes tree-wide (i.e. as in your v1), _and_
introduce a mechanism to keep the slimmed down version that is
relevant to check[] for each thread somehow.