On Tue, Apr 1, 2014 at 4:37 PM, Ben Langmuir <[email protected]> wrote:
> > On Apr 1, 2014, at 3:28 PM, Richard Smith <[email protected]> wrote: > > On Tue, Apr 1, 2014 at 12:43 PM, Ben Langmuir <[email protected]> wrote: > >> >> On Mar 31, 2014, at 7:59 PM, Richard Smith <[email protected]> wrote: >> >> Sorry, I think we're getting off-track here, my bad. I have a couple of >> in-line comments then I'm going to try to pull this discussion together at >> the bottom... >> >> On Mon, Mar 31, 2014 at 4:08 PM, Ben Langmuir <[email protected]> >> wrote: >> >>> >>> On Mar 31, 2014, at 3:24 PM, Richard Smith <[email protected]> >>> wrote: >>> >>> On Mon, Mar 31, 2014 at 11:40 AM, Ben Langmuir <[email protected]> >>> wrote: >>> >>>> >>>> On Mar 28, 2014, at 4:42 PM, Richard Smith <[email protected]> >>>> wrote: >>>> >>>> On Fri, Mar 28, 2014 at 3:27 PM, Ben Langmuir <[email protected]> >>>> wrote: >>>> >>>>> Hi Richard, >>>>> >>>>> >>>>> On Mar 28, 2014, at 2:45 PM, Richard Smith <[email protected]> >>>>> wrote: >>>>> >>>>> On Fri, Mar 28, 2014 at 1:12 PM, Ben Langmuir <[email protected]> >>>>> wrote: >>>>> >>>>>> This patch allows multiple modules that have the same name to coexist >>>>>> in the module cache. To differentiate between two modules with the same >>>>>> name, we will consider the path the module map file that they are defined >>>>>> by* part of the 'key' for looking up the precompiled module (pcm file). >>>>>> Specifically, this patch renames the precompiled module (pcm) files from >>>>>> >>>>>> cache-path/<module hash>/Foo.pcm >>>>>> >>>>>> to >>>>>> >>>>>> cache-path/<module hash>/Foo-<hash of module map path>.pcm >>>>>> >>>>> >>>>> From a high level, I don't really see why we need a second hash here. >>>>> Shouldn't the -I options be included in the <module hash>? If I build the >>>>> same module with different -I flags, that should resolve to different .pcm >>>>> files, regardless of whether it makes the module name resolve to a >>>>> different module.map file. >>>>> >>>>> >>>>> If we include the -I options in the module hash, we will explode the >>>>> number of module compilations needed. The following should all be able to >>>>> share a module 'A'. >>>>> >>>> >>>> I really don't think they should, if that second -I path is used in any >>>> way when building module 'A'. >>>> >>>> >>>> clang -fmodules -I /path/to/A -I /path/to/B some_file.c >>>>> clang -fmodules -I /path/to/A -I /path/to/C some_file2.c >>>>> clang -fmodules -I /path/to/A -I /path/to/D some_file3.c >>>>> >>>> >>>> I would think a better solution here would be to not have that second >>>> -I path in header search when building module A (and thus not include it in >>>> the hash for a principled reason). >>>> >>>> Ultimately, each module should have its own header search path -- the >>>> model of one set of include paths for the entire TU (with that include path >>>> potentially causing some libraries to find the wrong files) is a broken, >>>> antiquated legacy of the non-modules world. >>>> >>>> >>>> I agree with this in principle - having a predictable context for >>>> building the modules would solve a lot of problems. However, that seems to >>>> only work for semantic import (@import, not #include), which is the less >>>> common case for us at the moment. >>>> >>> >>> Can you explain why? Suppose every module.map file lists the include >>> paths for that module -- what problems does that create? (Is this problem >>> specific to us generating implicit module.map files for frameworks?) >>> >>> >>> If I write #include, or #import then I expect it to obey the include >>> paths I set, since that is how it always worked before (for better or >>> worse). For @import, and whatever other new syntax we introduce, I'm fully >>> in favour of isolating the module build from the happenstance of the >>> command line and using a header search derived from the module map file. >>> >> >> I think it's reasonable that adding a module.map file affects the >> semantics of a #include uses the module. In the case where someone has >> written a module map that specifies an include search path, it seems >> natural that that might (and should!) change the behavior of a #include >> mapped to that module. >> >> >> Conversely, I think it'd be surprising for a module build to pick up >> headers from some directory in the user of the module's code, whether I >> used @import or #include syntax to import the module. This is legacy >> brokenness from the pre-modules era that I think we should not bring with >> us. >> >> >> In the short term, we should probably drop all header search paths that >>>> are before the path in which the module map was found. >>>> >>>> >>>> To be clear, this has the same correctness problem as my approach - the >>>> header search paths that come before the module map may still change the >>>> content of the module, especially if the module depends on other modules. >>>> >>> >>> What I'm suggesting is: >>> >>> 1) Drop the -I paths that are earlier than the module in the header >>> search path when building the module >>> 2) Include the rest of the header search paths in the configuration hash >>> for the module >>> >>> I don't see how this still leads to collisions (other than the madness >>> that results from having multiple module.maps on your header search path >>> that define the same module, where the order in which we happen to load >>> them affects our semantics). >>> >>> >>> It doesn't introduce collisions, but it may change the actual content of >>> the module if the module includes headers that are affected by those >>> include paths. >>> >> >> I agree, but as noted above, I consider that to be a bug fix. No-one >> writes libraries that expect the library user to provide headers for them >> to include. If they did, I feel happy telling them that their library is >> non-modular and needs to be fixed before it can be used with modules. Can >> you give an example of a situation where you imagine this being an issue? >> >> >> In general, adding the -I options is brittle when precompiled headers are >>>> involved. Since you likely don't have exactly the same -I options when >>>> building your PCH as when you use it (you likely have fewer -I paths when >>>> building the PCH if it is being widely used) you could potentially have >>>> different module hashes in the PCH and in the main file, which leads to >>>> trying to load multiple copies of a module from different paths. I ran >>>> into this when I tried to naively add all of the -I options to the module >>>> hash. The heuristic you suggested for dropping the earlier paths might >>>> help here, but it would be required for correctness, not merely as an >>>> optimization. >>>> >>> >>> I'll need to think about this one some more. My initial reaction is that >>> if a PCH loads a module in one configuration and a user of the PCH tries to >>> load the same module in a different configuration (because it's being built >>> with different include paths), that should be a hard error. It seems to me >>> that this is a separate issue from ensuring that we don't have collisions >>> in the cache? >>> >>> >>> This patch does make it an error if the user of the PCH does not provide >>> a header search context in which the modules resolve the same way as when >>> the pch was built. I don't think expecting the PCH to be created with >>> exactly the same -I options is valuable, because it prevents legitimate >>> reuse. >>> >> >> OK. I think that's great as far as it goes, but it doesn't solve the >> whole problem, because non-modular #includes can still resolve in different >> ways. >> >> I am also not sure what effect this would have on the global module >>>> index, since the set of modules to load would span multiple >>>> hash-directories and we would need to avoid looking at incompatible >>>> modules. I'm not sure if this is a problem or not. >>>> >>> >>> Yes, that's an interesting problem =) >>> >>> >>>> Any thoughts? >>>> >>> >>> I don't think I've got enough information to suggest something concrete >>> yet, but this strikes me as a possible avenue to explore: >>> >>> 1) Teach module.map files to specify the include paths for the module >>> 2) When implicitly generating a module.map file, inject the include >>> paths from the host build (probably skipping ones prior to the directory >>> containing the framework) >>> 3) Include a hash of the contents of the implicit module.map file in the >>> cache entry for any module that has an implicit module.map >>> 4) Don't include the header search paths in the configuration hash, >>> since they're not used when determining how to build a module any more. >>> >>> >>> Does this help us with multiple modules having the same name? I think >>> we would still have the problem that when we go to load a pcm file it may >>> have been built from a different module.map file than the one we found in >>> this compilation (e.g. I have a module MyFoo, and so does another user on >>> the system and we happen to share a module cache directory). According to >>> 4 we don't hash the -I paths, so these modules would (potentially) resolve >>> to the same location on disk. >>> >>> This is convincing me more than ever that regardless of what else we >>> do, we need to store the path to the module map file with the pcm so that >>> we can verify that the module we are trying to import resolves to the same >>> location that it did when it was built. To me, the filename of the pcm is >>> the most obvious place for this information, since that allows us to >>> restrict ourselves to one module cache directory (ModuleCache/<module >>> hash>) just like now, and it allows modules with the same name to live in >>> the same cache without trampling each other or the possibility of >>> accidentally loading the wrong one. >>> >> >> I'd like to take a step back at this point and summarize what I believe >> the state to be (sorry that this is rather long). We have three pieces of >> data that between them uniquely identify the contents of a module: >> >> >> Thanks for the summary :) >> >> >> Build environment configuration: >> -------------------------------- >> >> This covers things like the language options (and, currently, >> configuration macros, though those probably belong in the module identity) >> We have one global module index per configuration, and we would like >> each compilation action to only touch a single configuration. >> We want to reuse the same build environment configuration as much as >> possible, to maximize cache hits. >> A hash of this data is used as the subdirectory in the module cache. >> >> Module identity: >> ---------------- >> >> This covers things like the name of the module and the module.map file >> in which it was defined. >> The proposed patch includes a hash of this data in the name of the .pcm >> file. >> >> Module header: >> -------------- >> >> This covers things that we discovered when building the module, such as >> the module's dependencies and their timestamps. >> The proposed patch adds the list of imported modules to this, such that >> header search must find the same module. >> >> If we have a mismatch here, we rebuild the module. >> >> So far, this division seems entirely reasonable to me. Observations: >> * We only want to use one global index per build >> * We want to minimize the amount of irrelevant stuff in the global index >> for efficiency reasons (otherwise we could just have one global index for >> all configurations), and >> * We want to minimize the number of times we rebuild modules >> >> From the above, I think that we should use the following guidelines: >> * For correctness, the (build environment + module identity + module >> header) must uniquely identify the module contents (because that's how we >> determine whether we have a cache hit). >> * We want to pick the data that goes into the build environment such >> that two module builds have the same build environment configuration if >> they could conceivably be used in the same TU, and we should try to keep >> their build environments different otherwise. >> * We only want to include things in the module header if changes to >> those things mean we'll probably never want to use the old .pcm file again. >> >> >> As long as the list of imported modules lives in the module header, I >> don't think this last guideline makes sense. For example, suppose I have a >> library A that depends on library B. I also have two users of library A, >> and they each require a different version of B. Or like the case above, I >> am the developer of B and want to test against A using my modified version. >> > > What I'm suggesting is that we should strive to put that data somewhere > other than the module header, since putting it in the module header will > cause unnecessary module rebuilds. If the set of module imports changes, > the optimal course of action depends on *why* it changes: if it's due to a > filesystem difference, we should rebuild the module (since the old pcm is > probably not useful any more), but if it's due to an include path change, > we should probably keep the existing .pcm file, since we might want to use > it again. > > But that's only the optimal course of action, and we could certainly do > something locally suboptimal here if it makes other factors better. > > The main point of contention seems to be how header search paths fit into >> the above taxonomy. I think we've discussed a few options here: >> >> 1) They are a property of the module, and are independent of the header >> search path of the module user. This means they are not part of the above >> data. >> >> 2) They are a property of the build environment. (This was my original >> position.) This seems to mean that the module identity can be just the >> module name, if we enforce that the same module name can't be found >> multiple times in the include path. >> >> >> There is a hole here that Argyrios pointed out in an earlier email. You >> can add a new module map file to a path already in the search, and that may >> change the module identity without changing its name or configuration. Now >> if we are enforcing that the module only be found once, you would >> presumably have to remove/modify the module map it was originally found in. >> But to be 100% correct we would still need to include the module map file >> in the module identity. >> > > Yes, I agree, and I think you've fully convinced me that we should > re-resolve of imported modules on module load (even if the include path is > unchanged). > > In fact, you've convinced me that all the changes your patch makes are > going in the right direction, so consider this an LGTM on the direction of > the patch. Thanks for your perseverance! I'd still like to keep hashing out > the details here, because I think we've not yet reached consensus on > whether your patch is a complete solution or just a partial one. > > > Great :) Do you also agree with using the module header to store and then > re-resolve the imports or is that still contentious? > That seems like the right approach for dealing with the possibility of a module map earlier in the include path introducing a module with the same name (even if we also wanted some other mechanism for dealing with changes to the include path). > 3) They are a property of the module identity. >> >> 4) They're not part of anything. The names and resolved paths of imported >> modules are included in the module header. We hope that the include path >> didn't affect anything else. (This is what the patch we're discussing does.) >> >> >> Option 4 seems to be only a partial solution to the include path issue, >> because we still have cache collisions between modules with different >> contents due to differing -I paths. Your suggested flag to make non-modular >> #includes inside a module an error would fix this, but I don't think that >> is going to work for much real-world code -- I don't think that non-modular >> #includes are going away. >> >> >> I agree this won't work for all code, but if the standard library is >> modularized I think it does cover a lot of cases. >> >> >> I think we can fix option 4 by also including the set of files #included >> by non-modular includes in the module header (and anything else that >> depends on the include paths), and re-resolving those files too when we >> import a module. If the lookup differs, we rebuild the module. This has the >> unfortunate effect that we might rebuild a module when the prior .pcm is >> still useful, but that seems to be a rare case. I could probably live with >> that. Does that sound OK to you? >> >> >> When we discussed this internally, we felt that it would be expensive >> (potentially *lots* of header search lookups), and also hairy to implement >> (you need not only the #include spelling, but also the "includes" context - >> maybe more?). >> > > OK, but if we don't do it, then we can get bogus hits in the module cache. > Perhaps those are rare enough that we don't need to worry, but they don't > seem fundamentally different from the cache hits you're trying to avoid by > re-resolving modules. > > > Right, I'm not trying to solve this with my patch. Maybe re-resolving the > non-modular includes is the best we can do for modules that don't have > their own search context and will not abide by the "it's modules all the > way down" rule. > That sounds like it might be a good tradeoff. And as you say, it's not something that this patch needs to deal with. Another question is what include paths we should use for a module: >> >> 1) They inherit all the include paths from the host build. (We keep the >> legacy broken model.) >> 2) They inherit the include paths for themselves and anything later in >> the header search path. >> 3) They get to specify their own include path. >> >> (1) is the status quo. I suggest we allow (3) and fall back to (2) if no >> include path is specified. >> >> >> So what exactly does (3) entail? Does it provide the *entire* search >> context, even for resolving dependent modules? If so, we are making the >> module map dependent on the system it is installed on. If not, then we >> still need to resolve module names during build to ensure they haven't >> changed, regardless of the mechanism we use to find them. >> > > I think a module should be able to isolate itself from the header search > configuration of its users entirely, so I'd expect this to be the entire > search context. Another thing we've been discussing internally is the > ability for module map files to contain references to other module map > files (so library X can say "I depend on module Y, whose module map is over > there"), which largely removes the need for include paths. > > > I'm also worried that (2) is so close to (1) that it will subtly break >> users' expectations. >> > > I think that's a fair concern. > > >
_______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
