On Apr 1, 2014, at 3:28 PM, Richard Smith <[email protected]> wrote:
> On Tue, Apr 1, 2014 at 12:43 PM, Ben Langmuir <[email protected]> wrote: > > On Mar 31, 2014, at 7:59 PM, Richard Smith <[email protected]> wrote: > >> Sorry, I think we're getting off-track here, my bad. I have a couple of >> in-line comments then I'm going to try to pull this discussion together at >> the bottom... >> >> On Mon, Mar 31, 2014 at 4:08 PM, Ben Langmuir <[email protected]> wrote: >> >> On Mar 31, 2014, at 3:24 PM, Richard Smith <[email protected]> wrote: >> >>> On Mon, Mar 31, 2014 at 11:40 AM, Ben Langmuir <[email protected]> wrote: >>> >>> On Mar 28, 2014, at 4:42 PM, Richard Smith <[email protected]> wrote: >>> >>>> On Fri, Mar 28, 2014 at 3:27 PM, Ben Langmuir <[email protected]> wrote: >>>> Hi Richard, >>>> >>>> >>>> On Mar 28, 2014, at 2:45 PM, Richard Smith <[email protected]> wrote: >>>> >>>>> On Fri, Mar 28, 2014 at 1:12 PM, Ben Langmuir <[email protected]> wrote: >>>>> This patch allows multiple modules that have the same name to coexist in >>>>> the module cache. To differentiate between two modules with the same >>>>> name, we will consider the path the module map file that they are defined >>>>> by* part of the ‘key’ for looking up the precompiled module (pcm file). >>>>> Specifically, this patch renames the precompiled module (pcm) files from >>>>> >>>>> cache-path/<module hash>/Foo.pcm >>>>> >>>>> to >>>>> >>>>> cache-path/<module hash>/Foo-<hash of module map path>.pcm >>>>> >>>>> From a high level, I don't really see why we need a second hash here. >>>>> Shouldn't the -I options be included in the <module hash>? If I build the >>>>> same module with different -I flags, that should resolve to different >>>>> .pcm files, regardless of whether it makes the module name resolve to a >>>>> different module.map file. >>>> >>>> If we include the -I options in the module hash, we will explode the >>>> number of module compilations needed. The following should all be able to >>>> share a module ‘A’. >>>> >>>> I really don't think they should, if that second -I path is used in any >>>> way when building module 'A’. >>>> >>>> clang -fmodules -I /path/to/A -I /path/to/B some_file.c >>>> clang -fmodules -I /path/to/A -I /path/to/C some_file2.c >>>> clang -fmodules -I /path/to/A -I /path/to/D some_file3.c >>>> >>>> I would think a better solution here would be to not have that second -I >>>> path in header search when building module A (and thus not include it in >>>> the hash for a principled reason). >>>> >>>> Ultimately, each module should have its own header search path -- the >>>> model of one set of include paths for the entire TU (with that include >>>> path potentially causing some libraries to find the wrong files) is a >>>> broken, antiquated legacy of the non-modules world. >>> >>> I agree with this in principle - having a predictable context for building >>> the modules would solve a lot of problems. However, that seems to only >>> work for semantic import (@import, not #include), which is the less common >>> case for us at the moment. >>> >>> Can you explain why? Suppose every module.map file lists the include paths >>> for that module -- what problems does that create? (Is this problem >>> specific to us generating implicit module.map files for frameworks?) >> >> If I write #include, or #import then I expect it to obey the include paths I >> set, since that is how it always worked before (for better or worse). For >> @import, and whatever other new syntax we introduce, I’m fully in favour of >> isolating the module build from the happenstance of the command line and >> using a header search derived from the module map file. >> >> I think it's reasonable that adding a module.map file affects the semantics >> of a #include uses the module. In the case where someone has written a >> module map that specifies an include search path, it seems natural that that >> might (and should!) change the behavior of a #include mapped to that module. >> >> >> Conversely, I think it'd be surprising for a module build to pick up headers >> from some directory in the user of the module's code, whether I used @import >> or #include syntax to import the module. This is legacy brokenness from the >> pre-modules era that I think we should not bring with us. >> >>>> In the short term, we should probably drop all header search paths that >>>> are before the path in which the module map was found. >>> >>> To be clear, this has the same correctness problem as my approach - the >>> header search paths that come before the module map may still change the >>> content of the module, especially if the module depends on other modules. >>> >>> What I'm suggesting is: >>> >>> 1) Drop the -I paths that are earlier than the module in the header search >>> path when building the module >>> 2) Include the rest of the header search paths in the configuration hash >>> for the module >>> >>> I don't see how this still leads to collisions (other than the madness that >>> results from having multiple module.maps on your header search path that >>> define the same module, where the order in which we happen to load them >>> affects our semantics). >> >> It doesn’t introduce collisions, but it may change the actual content of the >> module if the module includes headers that are affected by those include >> paths. >> >> I agree, but as noted above, I consider that to be a bug fix. No-one writes >> libraries that expect the library user to provide headers for them to >> include. If they did, I feel happy telling them that their library is >> non-modular and needs to be fixed before it can be used with modules. Can >> you give an example of a situation where you imagine this being an issue? >> >>> In general, adding the -I options is brittle when precompiled headers are >>> involved. Since you likely don’t have exactly the same -I options when >>> building your PCH as when you use it (you likely have fewer -I paths when >>> building the PCH if it is being widely used) you could potentially have >>> different module hashes in the PCH and in the main file, which leads to >>> trying to load multiple copies of a module from different paths. I ran >>> into this when I tried to naively add all of the -I options to the module >>> hash. The heuristic you suggested for dropping the earlier paths might >>> help here, but it would be required for correctness, not merely as an >>> optimization. >>> >>> I'll need to think about this one some more. My initial reaction is that if >>> a PCH loads a module in one configuration and a user of the PCH tries to >>> load the same module in a different configuration (because it's being built >>> with different include paths), that should be a hard error. It seems to me >>> that this is a separate issue from ensuring that we don't have collisions >>> in the cache? >> >> This patch does make it an error if the user of the PCH does not provide a >> header search context in which the modules resolve the same way as when the >> pch was built. I don’t think expecting the PCH to be created with exactly >> the same -I options is valuable, because it prevents legitimate reuse. >> >> OK. I think that's great as far as it goes, but it doesn't solve the whole >> problem, because non-modular #includes can still resolve in different ways. >> >>> I am also not sure what effect this would have on the global module index, >>> since the set of modules to load would span multiple hash-directories and >>> we would need to avoid looking at incompatible modules. I’m not sure if >>> this is a problem or not. >>> >>> Yes, that's an interesting problem =) >>> >>> Any thoughts? >>> >>> I don't think I've got enough information to suggest something concrete >>> yet, but this strikes me as a possible avenue to explore: >>> >>> 1) Teach module.map files to specify the include paths for the module >>> 2) When implicitly generating a module.map file, inject the include paths >>> from the host build (probably skipping ones prior to the directory >>> containing the framework) >>> 3) Include a hash of the contents of the implicit module.map file in the >>> cache entry for any module that has an implicit module.map >>> 4) Don't include the header search paths in the configuration hash, since >>> they're not used when determining how to build a module any more. >> >> Does this help us with multiple modules having the same name? I think we >> would still have the problem that when we go to load a pcm file it may have >> been built from a different module.map file than the one we found in this >> compilation (e.g. I have a module MyFoo, and so does another user on the >> system and we happen to share a module cache directory). According to 4 we >> don’t hash the -I paths, so these modules would (potentially) resolve to the >> same location on disk. >> >> This is convincing me more than ever that regardless of what else we do, we >> need to store the path to the module map file with the pcm so that we can >> verify that the module we are trying to import resolves to the same location >> that it did when it was built. To me, the filename of the pcm is the most >> obvious place for this information, since that allows us to restrict >> ourselves to one module cache directory (ModuleCache/<module hash>) just >> like now, and it allows modules with the same name to live in the same cache >> without trampling each other or the possibility of accidentally loading the >> wrong one. >> >> I'd like to take a step back at this point and summarize what I believe the >> state to be (sorry that this is rather long). We have three pieces of data >> that between them uniquely identify the contents of a module: > > Thanks for the summary :) > >> >> Build environment configuration: >> -------------------------------- >> >> This covers things like the language options (and, currently, >> configuration macros, though those probably belong in the module identity) >> We have one global module index per configuration, and we would like each >> compilation action to only touch a single configuration. >> We want to reuse the same build environment configuration as much as >> possible, to maximize cache hits. >> A hash of this data is used as the subdirectory in the module cache. >> >> Module identity: >> ---------------- >> >> This covers things like the name of the module and the module.map file in >> which it was defined. >> The proposed patch includes a hash of this data in the name of the .pcm >> file. >> >> Module header: >> -------------- >> >> This covers things that we discovered when building the module, such as >> the module's dependencies and their timestamps. >> The proposed patch adds the list of imported modules to this, such that >> header search must find the same module. >> If we have a mismatch here, we rebuild the module. >> >> So far, this division seems entirely reasonable to me. Observations: >> * We only want to use one global index per build >> * We want to minimize the amount of irrelevant stuff in the global index >> for efficiency reasons (otherwise we could just have one global index for >> all configurations), and >> * We want to minimize the number of times we rebuild modules >> >> From the above, I think that we should use the following guidelines: >> * For correctness, the (build environment + module identity + module >> header) must uniquely identify the module contents (because that's how we >> determine whether we have a cache hit). >> * We want to pick the data that goes into the build environment such that >> two module builds have the same build environment configuration if they >> could conceivably be used in the same TU, and we should try to keep their >> build environments different otherwise. >> * We only want to include things in the module header if changes to those >> things mean we'll probably never want to use the old .pcm file again. > > As long as the list of imported modules lives in the module header, I don’t > think this last guideline makes sense. For example, suppose I have a library > A that depends on library B. I also have two users of library A, and they > each require a different version of B. Or like the case above, I am the > developer of B and want to test against A using my modified version. > > What I'm suggesting is that we should strive to put that data somewhere other > than the module header, since putting it in the module header will cause > unnecessary module rebuilds. If the set of module imports changes, the > optimal course of action depends on *why* it changes: if it's due to a > filesystem difference, we should rebuild the module (since the old pcm is > probably not useful any more), but if it's due to an include path change, we > should probably keep the existing .pcm file, since we might want to use it > again. > > But that's only the optimal course of action, and we could certainly do > something locally suboptimal here if it makes other factors better. > >> The main point of contention seems to be how header search paths fit into >> the above taxonomy. I think we've discussed a few options here: >> >> 1) They are a property of the module, and are independent of the header >> search path of the module user. This means they are not part of the above >> data. >> >> 2) They are a property of the build environment. (This was my original >> position.) This seems to mean that the module identity can be just the >> module name, if we enforce that the same module name can't be found multiple >> times in the include path. > > There is a hole here that Argyrios pointed out in an earlier email. You can > add a new module map file to a path already in the search, and that may > change the module identity without changing its name or configuration. Now > if we are enforcing that the module only be found once, you would presumably > have to remove/modify the module map it was originally found in. But to be > 100% correct we would still need to include the module map file in the module > identity. > > Yes, I agree, and I think you've fully convinced me that we should re-resolve > of imported modules on module load (even if the include path is unchanged). > > In fact, you've convinced me that all the changes your patch makes are going > in the right direction, so consider this an LGTM on the direction of the > patch. Thanks for your perseverance! I'd still like to keep hashing out the > details here, because I think we've not yet reached consensus on whether your > patch is a complete solution or just a partial one. Great :) Do you also agree with using the module header to store and then re-resolve the imports or is that still contentious? > >> 3) They are a property of the module identity. >> >> 4) They're not part of anything. The names and resolved paths of imported >> modules are included in the module header. We hope that the include path >> didn't affect anything else. (This is what the patch we're discussing does.) >> >> Option 4 seems to be only a partial solution to the include path issue, >> because we still have cache collisions between modules with different >> contents due to differing -I paths. Your suggested flag to make non-modular >> #includes inside a module an error would fix this, but I don't think that is >> going to work for much real-world code -- I don't think that non-modular >> #includes are going away. > > I agree this won’t work for all code, but if the standard library is > modularized I think it does cover a lot of cases. > >> >> I think we can fix option 4 by also including the set of files #included by >> non-modular includes in the module header (and anything else that depends on >> the include paths), and re-resolving those files too when we import a >> module. If the lookup differs, we rebuild the module. This has the >> unfortunate effect that we might rebuild a module when the prior .pcm is >> still useful, but that seems to be a rare case. I could probably live with >> that. Does that sound OK to you? > > When we discussed this internally, we felt that it would be expensive > (potentially *lots* of header search lookups), and also hairy to implement > (you need not only the #include spelling, but also the “includes” context - > maybe more?). > > OK, but if we don't do it, then we can get bogus hits in the module cache. > Perhaps those are rare enough that we don't need to worry, but they don't > seem fundamentally different from the cache hits you're trying to avoid by > re-resolving modules. Right, I’m not trying to solve this with my patch. Maybe re-resolving the non-modular includes is the best we can do for modules that don’t have their own search context and will not abide by the “it’s modules all the way down” rule. > >> Another question is what include paths we should use for a module: >> >> 1) They inherit all the include paths from the host build. (We keep the >> legacy broken model.) >> 2) They inherit the include paths for themselves and anything later in the >> header search path. >> 3) They get to specify their own include path. >> >> (1) is the status quo. I suggest we allow (3) and fall back to (2) if no >> include path is specified. > > > So what exactly does (3) entail? Does it provide the *entire* search context, > even for resolving dependent modules? If so, we are making the module map > dependent on the system it is installed on. If not, then we still need to > resolve module names during build to ensure they haven’t changed, regardless > of the mechanism we use to find them. > > I think a module should be able to isolate itself from the header search > configuration of its users entirely, so I'd expect this to be the entire > search context. Another thing we've been discussing internally is the ability > for module map files to contain references to other module map files (so > library X can say "I depend on module Y, whose module map is over there"), > which largely removes the need for include paths. > > I’m also worried that (2) is so close to (1) that it will subtly break users’ > expectations. > > I think that's a fair concern.
_______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
