Re: r228604 - Diagnose timeouts in the LockFileManager and delete the dead lock file

Ben Langmuir Thu, 26 Feb 2015 16:00:47 -0800

> On Feb 26, 2015, at 3:38 PM, Richard Smith <[email protected]> wrote:
> 
> On Thu, Feb 26, 2015 at 3:17 PM, Ben Langmuir <[email protected] 
> <mailto:[email protected]>> wrote:
>> On Feb 26, 2015, at 2:49 PM, Richard Smith <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On Thu, Feb 26, 2015 at 8:30 AM, Ben Langmuir <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> On Feb 25, 2015, at 6:20 PM, Richard Smith <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> On Tue, Feb 24, 2015 at 7:55 PM, Ben Langmuir <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> On Feb 24, 2015, at 6:57 PM, Richard Smith <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> On Mon, Feb 9, 2015 at 12:35 PM, Ben Langmuir <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Author: benlangmuir
>>>> Date: Mon Feb  9 14:35:13 2015
>>>> New Revision: 228604
>>>> 
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=228604&view=rev 
>>>> <http://llvm.org/viewvc/llvm-project?rev=228604&view=rev>
>>>> Log:
>>>> Diagnose timeouts in the LockFileManager and delete the dead lock file
>>>> 
>>>> If the lock file manager times out, we should give an error rather than
>>>> silently trying to load the existing module.  And delete the
>>>> (presumably) dead lock file, since it will otherwise prevent progress in
>>>> future invokations. This is unsound since we have no way to prove that
>>>> the lock file we are deleting is the same one we timed out on, but since
>>>> the lock is only to avoid excessive rebuilding anyway it should be okay.
>>>> Depends on llvm r228603.
>>>> 
>>>> This diagnostic fires a *lot* when, say, bootstrapping Clang with modules 
>>>> enabled at -j16. Is the timeout perhaps too short?
>>> 
>>> I lowered the timeout to 1 minute in my llvm commit,
>>> 
>>> Well, technically, it went from 8m 44s to 65.5s due to the exponential 
>>> backoff... :-/
>> 
>> Heh, good point.
>> 
>>> 
>>> so if it helps, feel free to bump it back up.  I’m not particularly 
>>> concerned about the actual number now that we will actually honour the 
>>> limit, but the old value of 5 minutes seemed really long to me.
>>> 
>>> Really? Consider a Debug+Asserts clang, building a large module (such as 
>>> the Clang_AST module). Requiring that to succeed in 65.5 seconds doesn't 
>>> seem particularly reasonable (and, indeed, some such modules do not build 
>>> in that environment in 65.5s on my machine, at least when it's heavily 
>>> loaded).
>> 
>> That sucks.
>> 
>>> 
>>> But I'm not really sure that a timeout is the right approach here at all. 
>>> What we want to track is whether the creating process is still alive. 
>>> Perhaps we should have that process periodically bump the timestamp on the 
>>> lock file, and assume it's died if the timestamp is too far in the past? 
>>> This'd mean that in the (scary) case of sharing a module cache across a 
>>> build farm, you'd need your machines to not have too much clock skew, but 
>>> that's already necessary for build systems that use timestamps to determine 
>>> what to rebuild, so it doesn't seem like a ridiculous restriction.
>>> 
>>> Thoughts?
>> 
>> I think what you’re proposing is a better way to track whether the creator 
>> is alive than what we currently do (looking at the hostname+pid).  We could 
>> eliminate the clock-sync dependence if the waiting process set up its own 
>> local timer and then reset it if it sees the timestamp of the lock file has 
>> changed (ie. we don’t care what the value is, only whether it changes).  But 
>> your suggestion is simpler and automatically handles a new process waiting 
>> on a long-dead file in a good way.
>> 
>> At some level I like that the timeout gives you feedback when a compilation 
>> takes much longer than expected.  How about this: in addition to tracking 
>> process liveness, we also use a timeout.  When the timeout is reached, we 
>> emit a warning that the module build is taking a long time.  We then 
>> forcibly delete the lock file and try again to acquire the lock ourselves. 
>> This has the added advantage of combatting priority inversion.  If we reach 
>> the timeout again (maybe several times?), we emit an error and fail.  What 
>> do you think? 
>> 
>> For the most part, this sounds great to me. I'm not entirely convinced that 
>> it makes sense to remove the lock and retry if we can see that another 
>> process is periodically touching it.
> 
> Okay, but right now any time we delete a lock file we may accidentally delete 
> a valid lock because of races.  I haven’t found a good way to avoid the 
> window between checking a lock file is invalid and deleting that file (which 
> is done by name).
> 
> So long as we're conservatively correct in the case of a race, and the 
> probability is low enough, I don't think this is a huge problem (IIUC, we 
> should get a signature mismatch in any downstream modules if this goes wrong, 
> and end up triggering a bunch of additional rebuilds, but we should converge 
> to a stable set of module files in the end). The proposed direction reduces 
> the probability of a race (we only delete the file and retry if the producing 
> end appears to not be updating it), so it seems like an improvement.


Right.

> 
> I think this is just part of the cost we pay by providing an implicit module 
> building system and a transparent upgrade path; we do also provide the 
> ability to explicitly build modules, which avoids all these issues. (That 
> said, we don't yet have a non-cc1 way of building a module explicitly, and we 
> really should add one.)
>> I think issuing a remark after some timeout (1 minute, maybe?), and possibly 
>> failing after a longer timeout, may be reasonable.
> 
> I was thinking a warning, but I guess -Werror makes that a bad idea.  Failing 
> instead of clearing the lock file is okay with me.
> 
>> 
>> We'd need a mechanism to turn off the second timeout (for cases where 
>> modules simply take a long time to build for whatever reason) and possibly 
>> to turn off the first timeout (to ensure we provide deterministic output).
> 
> -Rmodule-timeout-soft
> -Wmodule-timeout (DefaultError)
> 
> We could theoretically make a -W group that included both for convenience, 
> although that might be weird with a remark...
> 
> We want to bail out of compilation when we hit the timeout, so a -W flag 
> doesn't seem like the right model to me (it would need to control more than 
> merely whether we emit the diagnostic). How about 
> -fimplicit-modules-build-timeout=<time in s>, following the usual convention 
> that 0 means "no limit”?

Sounds good.  If we’re going to customize the timeout, it would be nice if the 
remark timeout is relative to the error timeout.  Maybe a 1:5 ratio, with a 
default of 1 or 2 minutes for the remark (so 5 or 10 minutes for the error).  
I’m not sure how well that would come across in the -help text for this option 
though...

> 
> Another question: where do you think we should touch the file? In a consumer 
> of top-level decls?
> 
> An ASTConsumer could be a reasonable place. That should also let us watch for 
> template instantiations performed at end of TU, which are sometimes a 
> sizeable portion of the parsing time.

Sounds good.

>  
>>>> Modified:
>>>>     cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>>     cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>> 
>>>> Modified: cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>> URL: 
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff
>>>>  
>>>> <http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff>
>>>> ==============================================================================
>>>> --- cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td (original)
>>>> +++ cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td Mon Feb  9 
>>>> 14:35:13 2015
>>>> @@ -83,6 +83,8 @@ def err_module_not_found : Error<"module
>>>>  def err_module_not_built : Error<"could not build module '%0'">, 
>>>> DefaultFatal;
>>>>  def err_module_lock_failure : Error<
>>>>    "could not acquire lock file for module '%0'">, DefaultFatal;
>>>> +def err_module_lock_timeout : Error<
>>>> +  "timed out waiting to acquire lock file for module '%0'">, DefaultFatal;
>>>>  def err_module_cycle : Error<"cyclic dependency in module '%0': %1">,
>>>>    DefaultFatal;
>>>>  def note_pragma_entered_here : Note<"#pragma entered here">;
>>>> 
>>>> Modified: cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>> URL: 
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff
>>>>  
>>>> <http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/Frontend/CompilerInstance.cpp (original)
>>>> +++ cfe/trunk/lib/Frontend/CompilerInstance.cpp Mon Feb  9 14:35:13 2015
>>>> @@ -1022,9 +1022,19 @@ static bool compileAndLoadModule(Compile
>>>>      case llvm::LockFileManager::LFS_Shared:
>>>>        // Someone else is responsible for building the module. Wait for 
>>>> them to
>>>>        // finish.
>>>> -      if (Locked.waitForUnlock() == llvm::LockFileManager::Res_OwnerDied)
>>>> +      switch (Locked.waitForUnlock()) {
>>>> +      case llvm::LockFileManager::Res_Success:
>>>> +        ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +        break;
>>>> +      case llvm::LockFileManager::Res_OwnerDied:
>>>>          continue; // try again to get the lock.
>>>> -      ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +      case llvm::LockFileManager::Res_Timeout:
>>>> +        Diags.Report(ModuleNameLoc, diag::err_module_lock_timeout)
>>>> +            << Module->Name;
>>>> +        // Clear the lock file so that future invokations can make 
>>>> progress.
>>>> +        Locked.unsafeRemoveLockFile();
>>>> +        return false;
>>>> +      }
>>>>        break;
>>>>      }
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> cfe-commits mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits 
>>>> <http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits>

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: r228604 - Diagnose timeouts in the LockFileManager and delete the dead lock file

Reply via email to