Re: [PATCH] Allow multiple modules with the same name to coexist in the module cache

Ben Langmuir Tue, 01 Apr 2014 16:40:22 -0700

On Apr 1, 2014, at 3:28 PM, Richard Smith <[email protected]> wrote:


> On Tue, Apr 1, 2014 at 12:43 PM, Ben Langmuir <[email protected]> wrote:
> 
> On Mar 31, 2014, at 7:59 PM, Richard Smith <[email protected]> wrote:
> 
>> Sorry, I think we're getting off-track here, my bad. I have a couple of 
>> in-line comments then I'm going to try to pull this discussion together at 
>> the bottom...
>> 
>> On Mon, Mar 31, 2014 at 4:08 PM, Ben Langmuir <[email protected]> wrote:
>> 
>> On Mar 31, 2014, at 3:24 PM, Richard Smith <[email protected]> wrote:
>> 
>>> On Mon, Mar 31, 2014 at 11:40 AM, Ben Langmuir <[email protected]> wrote:
>>> 
>>> On Mar 28, 2014, at 4:42 PM, Richard Smith <[email protected]> wrote:
>>> 
>>>> On Fri, Mar 28, 2014 at 3:27 PM, Ben Langmuir <[email protected]> wrote:
>>>> Hi Richard,
>>>> 
>>>> 
>>>> On Mar 28, 2014, at 2:45 PM, Richard Smith <[email protected]> wrote:
>>>> 
>>>>> On Fri, Mar 28, 2014 at 1:12 PM, Ben Langmuir <[email protected]> wrote:
>>>>> This patch allows multiple modules that have the same name to coexist in 
>>>>> the module cache.  To differentiate between two modules with the same 
>>>>> name, we will consider the path the module map file that they are defined 
>>>>> by* part of the ‘key’ for looking up the precompiled module (pcm file).  
>>>>> Specifically, this patch renames the precompiled module (pcm) files from
>>>>> 
>>>>> cache-path/<module hash>/Foo.pcm
>>>>> 
>>>>> to
>>>>> 
>>>>> cache-path/<module hash>/Foo-<hash of module map path>.pcm
>>>>> 
>>>>> From a high level, I don't really see why we need a second hash here. 
>>>>> Shouldn't the -I options be included in the <module hash>? If I build the 
>>>>> same module with different -I flags, that should resolve to different 
>>>>> .pcm files, regardless of whether it makes the module name resolve to a 
>>>>> different module.map file.
>>>> 
>>>> If we include the -I options in the module hash, we will explode the 
>>>> number of module compilations needed.  The following should all be able to 
>>>> share a module ‘A’.
>>>> 
>>>> I really don't think they should, if that second -I path is used in any 
>>>> way when building module 'A’.
>>>> 
>>>> clang -fmodules -I /path/to/A -I /path/to/B some_file.c
>>>> clang -fmodules -I /path/to/A -I /path/to/C some_file2.c
>>>> clang -fmodules -I /path/to/A -I /path/to/D some_file3.c
>>>> 
>>>> I would think a better solution here would be to not have that second -I 
>>>> path in header search when building module A (and thus not include it in 
>>>> the hash for a principled reason).
>>>> 
>>>> Ultimately, each module should have its own header search path -- the 
>>>> model of one set of include paths for the entire TU (with that include 
>>>> path potentially causing some libraries to find the wrong files) is a 
>>>> broken, antiquated legacy of the non-modules world. 
>>> 
>>> I agree with this in principle - having a predictable context for building 
>>> the modules would solve a lot of problems.  However, that seems to only 
>>> work for semantic import (@import, not #include), which is the less common 
>>> case for us at the moment.  
>>> 
>>> Can you explain why? Suppose every module.map file lists the include paths 
>>> for that module -- what problems does that create? (Is this problem 
>>> specific to us generating implicit module.map files for frameworks?)
>> 
>> If I write #include, or #import then I expect it to obey the include paths I 
>> set, since that is how it always worked before (for better or worse). For 
>> @import, and whatever other new syntax we introduce, I’m fully in favour of 
>> isolating the module build from the happenstance of the command line and 
>> using a header search derived from the module map file.
>> 
>> I think it's reasonable that adding a module.map file affects the semantics 
>> of a #include uses the module. In the case where someone has written a 
>> module map that specifies an include search path, it seems natural that that 
>> might (and should!) change the behavior of a #include mapped to that module. 
>>  
>> 
>> Conversely, I think it'd be surprising for a module build to pick up headers 
>> from some directory in the user of the module's code, whether I used @import 
>> or #include syntax to import the module. This is legacy brokenness from the 
>> pre-modules era that I think we should not bring with us.
>> 
>>>> In the short term, we should probably drop all header search paths that 
>>>> are before the path in which the module map was found.
>>> 
>>> To be clear, this has the same correctness problem as my approach - the 
>>> header search paths that come before the module map may still change the 
>>> content of the module, especially if the module depends on other modules.
>>> 
>>> What I'm suggesting is:
>>> 
>>> 1) Drop the -I paths that are earlier than the module in the header search 
>>> path when building the module
>>> 2) Include the rest of the header search paths in the configuration hash 
>>> for the module
>>> 
>>> I don't see how this still leads to collisions (other than the madness that 
>>> results from having multiple module.maps on your header search path that 
>>> define the same module, where the order in which we happen to load them 
>>> affects our semantics).
>> 
>> It doesn’t introduce collisions, but it may change the actual content of the 
>> module if the module includes headers that are affected by those include 
>> paths.
>> 
>> I agree, but as noted above, I consider that to be a bug fix. No-one writes 
>> libraries that expect the library user to provide headers for them to 
>> include. If they did, I feel happy telling them that their library is 
>> non-modular and needs to be fixed before it can be used with modules. Can 
>> you give an example of a situation where you imagine this being an issue?
>> 
>>> In general, adding the -I options is brittle when precompiled headers are 
>>> involved. Since you likely don’t have exactly the same -I options when 
>>> building your PCH as when you use it (you likely have fewer -I paths when 
>>> building the PCH if it is being widely used) you could potentially have 
>>> different module hashes in the PCH and in the main file, which leads to 
>>> trying to load multiple copies of a module from different paths.  I ran 
>>> into this when I tried to naively add all of the -I options to the module 
>>> hash.  The heuristic you suggested for dropping the earlier paths might 
>>> help here, but it would be required for correctness, not merely as an 
>>> optimization.
>>> 
>>> I'll need to think about this one some more. My initial reaction is that if 
>>> a PCH loads a module in one configuration and a user of the PCH tries to 
>>> load the same module in a different configuration (because it's being built 
>>> with different include paths), that should be a hard error. It seems to me 
>>> that this is a separate issue from ensuring that we don't have collisions 
>>> in the cache?
>> 
>> This patch does make it an error if the user of the PCH does not provide a 
>> header search context in which  the modules resolve the same way as when the 
>> pch was built.  I don’t think expecting the PCH to be created with exactly 
>> the same -I options is valuable, because it prevents legitimate reuse.
>> 
>> OK. I think that's great as far as it goes, but it doesn't solve the whole 
>> problem, because non-modular #includes can still resolve in different ways.
>> 
>>> I am also not sure what effect this would have on the global module index, 
>>> since the set of modules to load would span multiple hash-directories and 
>>> we would need to avoid looking at incompatible modules.  I’m not sure if 
>>> this is a problem or not.
>>> 
>>> Yes, that's an interesting problem =)
>>>  
>>> Any thoughts?
>>> 
>>> I don't think I've got enough information to suggest something concrete 
>>> yet, but this strikes me as a possible avenue to explore:
>>> 
>>> 1) Teach module.map files to specify the include paths for the module
>>> 2) When implicitly generating a module.map file, inject the include paths 
>>> from the host build (probably skipping ones prior to the directory 
>>> containing the framework)
>>> 3) Include a hash of the contents of the implicit module.map file in the 
>>> cache entry for any module that has an implicit module.map
>>> 4) Don't include the header search paths in the configuration hash, since 
>>> they're not used when determining how to build a module any more.
>> 
>> Does this help us with multiple modules having the same name?  I think we 
>> would still have the problem that when we go to load a pcm file it may have 
>> been built from a different module.map file than the one we found in this 
>> compilation (e.g. I have a module MyFoo, and so does another user on the 
>> system and we happen to share a module cache directory).  According to 4 we 
>> don’t hash the -I paths, so these modules would (potentially) resolve to the 
>> same location on disk.
>> 
>> This is convincing me more than ever that  regardless of what else we do, we 
>> need to store the path to the module map file with the pcm so that we can 
>> verify that the module we are trying to import resolves to the same location 
>> that it did when it was built. To me, the filename of the pcm is the most 
>> obvious place for this information, since that allows us to restrict 
>> ourselves to one module cache directory (ModuleCache/<module hash>) just 
>> like now, and it allows modules with the same name to live in the same cache 
>> without trampling each other or the possibility of accidentally loading the 
>> wrong one.
>> 
>> I'd like to take a step back at this point and summarize what I believe the 
>> state to be (sorry that this is rather long). We have three pieces of data 
>> that between them uniquely identify the contents of a module:
> 
> Thanks for the summary :)
> 
>> 
>> Build environment configuration:
>> --------------------------------
>> 
>>   This covers things like the language options (and, currently, 
>> configuration macros, though those probably belong in the module identity)
>>   We have one global module index per configuration, and we would like each 
>> compilation action to only touch a single configuration.
>>   We want to reuse the same build environment configuration as much as 
>> possible, to maximize cache hits.
>>   A hash of this data is used as the subdirectory in the module cache.
>> 
>> Module identity:
>> ----------------
>> 
>>   This covers things like the name of the module and the module.map file in 
>> which it was defined.
>>   The proposed patch includes a hash of this data in the name of the .pcm 
>> file.
>> 
>> Module header:
>> --------------
>> 
>>   This covers things that we discovered when building the module, such as 
>> the module's dependencies and their timestamps.
>>   The proposed patch adds the list of imported modules to this, such that 
>> header search must find the same module.
>>   If we have a mismatch here, we rebuild the module.
>> 
>> So far, this division seems entirely reasonable to me. Observations:
>>  * We only want to use one global index per build
>>  * We want to minimize the amount of irrelevant stuff in the global index 
>> for efficiency reasons (otherwise we could just have one global index for 
>> all configurations), and
>>  * We want to minimize the number of times we rebuild modules
>> 
>> From the above, I think that we should use the following guidelines:
>>  * For correctness, the (build environment + module identity + module 
>> header) must uniquely identify the module contents (because that's how we 
>> determine whether we have a cache hit).
>>  * We want to pick the data that goes into the build environment such that 
>> two module builds have the same build environment configuration if they 
>> could conceivably be used in the same TU, and we should try to keep their 
>> build environments different otherwise.
>>  * We only want to include things in the module header if changes to those 
>> things mean we'll probably never want to use the old .pcm file again.
> 
> As long as the list of imported modules lives in the module header, I don’t 
> think this last guideline makes sense.  For example, suppose I have a library 
> A that depends on library B.  I also have two users of library A, and they 
> each require a different version of B.  Or like the case above, I am the 
> developer of B and want to test against A using my modified version.
> 
> What I'm suggesting is that we should strive to put that data somewhere other 
> than the module header, since putting it in the module header will cause 
> unnecessary module rebuilds. If the set of module imports changes, the 
> optimal course of action depends on *why* it changes: if it's due to a 
> filesystem difference, we should rebuild the module (since the old pcm is 
> probably not useful any more), but if it's due to an include path change, we 
> should probably keep the existing .pcm file, since we might want to use it 
> again.
> 
> But that's only the optimal course of action, and we could certainly do 
> something locally suboptimal here if it makes other factors better.
> 
>> The main point of contention seems to be how header search paths fit into 
>> the above taxonomy. I think we've discussed a few options here:
>> 
>> 1) They are a property of the module, and are independent of the header 
>> search path of the module user. This means they are not part of the above 
>> data.
>> 
>> 2) They are a property of the build environment. (This was my original 
>> position.) This seems to mean that the module identity can be just the 
>> module name, if we enforce that the same module name can't be found multiple 
>> times in the include path.
> 
> There is a hole here that Argyrios pointed out in an earlier email.  You can 
> add a new module map file to a path already in the search, and that may 
> change the module identity without changing its name or configuration.  Now 
> if we are enforcing that the module only be found once, you would presumably 
> have to remove/modify the module map it was originally found in.  But to be 
> 100% correct we would still need to include the module map file in the module 
> identity.
> 
> Yes, I agree, and I think you've fully convinced me that we should re-resolve 
> of imported modules on module load (even if the include path is unchanged).
> 
> In fact, you've convinced me that all the changes your patch makes are going 
> in the right direction, so consider this an LGTM on the direction of the 
> patch. Thanks for your perseverance! I'd still like to keep hashing out the 
> details here, because I think we've not yet reached consensus on whether your 
> patch is a complete solution or just a partial one.

Great :) Do you also agree with using the module header to store and then 
re-resolve the imports or is that still contentious?

> 
>> 3) They are a property of the module identity.
>> 
>> 4) They're not part of anything. The names and resolved paths of imported 
>> modules are included in the module header. We hope that the include path 
>> didn't affect anything else. (This is what the patch we're discussing does.)
>> 
>> Option 4 seems to be only a partial solution to the include path issue, 
>> because we still have cache collisions between modules with different 
>> contents due to differing -I paths. Your suggested flag to make non-modular 
>> #includes inside a module an error would fix this, but I don't think that is 
>> going to work for much real-world code -- I don't think that non-modular 
>> #includes are going away.
> 
> I agree this won’t work for all code, but if the standard library is 
> modularized I think it does cover a lot of cases.
> 
>> 
>> I think we can fix option 4 by also including the set of files #included by 
>> non-modular includes in the module header (and anything else that depends on 
>> the include paths), and re-resolving those files too when we import a 
>> module. If the lookup differs, we rebuild the module. This has the 
>> unfortunate effect that we might rebuild a module when the prior .pcm is 
>> still useful, but that seems to be a rare case. I could probably live with 
>> that. Does that sound OK to you?
> 
> When we discussed this internally, we felt that it would be expensive 
> (potentially *lots* of header search lookups), and also hairy to implement 
> (you need not only the #include spelling, but also the “includes” context - 
> maybe more?).
> 
> OK, but if we don't do it, then we can get bogus hits in the module cache. 
> Perhaps those are rare enough that we don't need to worry, but they don't 
> seem fundamentally different from the cache hits you're trying to avoid by 
> re-resolving modules.

Right, I’m not trying to solve this with my patch.  Maybe re-resolving the 
non-modular includes is the best we can do for modules that don’t have their 
own search context and will not abide by the “it’s modules all the way down” 
rule.

> 
>> Another question is what include paths we should use for a module:
>> 
>> 1) They inherit all the include paths from the host build. (We keep the 
>> legacy broken model.)
>> 2) They inherit the include paths for themselves and anything later in the 
>> header search path.
>> 3) They get to specify their own include path.
>> 
>> (1) is the status quo. I suggest we allow (3) and fall back to (2) if no 
>> include path is specified.
> 
> 
> So what exactly does (3) entail? Does it provide the *entire* search context, 
> even for resolving dependent modules?  If so, we are making the module map 
> dependent on the system it is installed on.  If not, then we still need to 
> resolve module names during build to ensure they haven’t changed, regardless 
> of the mechanism we use to find them.
> 
> I think a module should be able to isolate itself from the header search 
> configuration of its users entirely, so I'd expect this to be the entire 
> search context. Another thing we've been discussing internally is the ability 
> for module map files to contain references to other module map files (so 
> library X can say "I depend on module Y, whose module map is over there"), 
> which largely removes the need for include paths.
> 
> I’m also worried that (2) is so close to (1) that it will subtly break users’ 
> expectations.
> 
> I think that's a fair concern.

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [PATCH] Allow multiple modules with the same name to coexist in the module cache

Reply via email to