[PATCH] D106064: [clang][deps] Normalize paths in minimizing file system

Jan Svoboda via Phabricator via cfe-commits Mon, 19 Jul 2021 10:19:51 -0700

jansvoboda11 added a comment.

With the call to `llvm::sys::path::native` scoped only to `IgnoredFiles`, would 
this patch LGTY?




================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:161-162
+    const StringRef RawFilename) {
+  llvm::SmallString<256> Filename;
+  llvm::sys::path::native(RawFilename, Filename);
+
----------------
dexonsmith wrote:
> jansvoboda11 wrote:
> > dexonsmith wrote:
> > > I'm a bit nervous about the impact of modifying the input filename on 
> > > Windows before passing it into other APIs. This could change behaviour of 
> > > lower layers of the VFS (since they'll see a different filename than when 
> > > DependencyScanningWOrkerFileSystem is NOT on top of them).
> > > 
> > > Can we restrict this just to what's passed to IgnoredFiles? (Maybe add 
> > > `shouldIgnore()` API, which returns `false` if the set is empty, and then 
> > > locally converts to native and checks for membership...)
> > > 
> > > It also seems wasteful to be calling `sys::path::native` and the memcpy 
> > > all the time, when usually it has no effect. Have you checked whether 
> > > this affects performance of scanning something big?
> > Yeah, I can see that path changing between VFS layers can be problematic. 
> > I'm pretty sure we can get away with only converting `Filename` to its 
> > native form when interacting with `IgnoredFiles`.
> > 
> > I haven't checked the performance impact. If it ends up being measurable, I 
> > could implement something like `sys::path::is_native` and avoid the copy 
> > most of the time on unix-like OSes. WDYT?
> Probably it'll end up not being measurable, but if it is, something like 
> `is_native` might help... that said, if this will eventually be replaced with 
> logic relyin on fs::UniqueID it might not be worth optimizing.
Agreed.


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:171-172
 
   bool KeepOriginalSource = IgnoredFiles.count(Filename) ||
                             !shouldMinimize(Filename);
   DependencyScanningFilesystemSharedCache::SharedFileSystemEntry
----------------
dexonsmith wrote:
> jansvoboda11 wrote:
> > dexonsmith wrote:
> > > Looking at this, makes me wonder if this is just fixing a specific 
> > > instance of a more general problem.
> > > 
> > > Maybe `IgnoredFiles` should be a set of `FileEntry`s instead of 
> > > `StringRef`s... but that'd create a different performance bottleneck when 
> > > the set is big, since creating the FileEntrys would be expensive. We'd 
> > > want the FileEntry lookup to be globally cached / etc. -- and FileManager 
> > > isn't quite safe to use globally.
> > > 
> > > Do you think IgnoredFiles as-is will work well enough for where it'll be 
> > > used for PCH? Or do we need to catch headers referenced in two different 
> > > ways somehow?
> > I think we could use `llvm::sys::fs::UniqueID` instead of the filename to 
> > refer to files. Since the VFS layer resolves symlinks when stat-ing a file, 
> > that should be a canonical file identifier. I can tackle that in a follow 
> > up patch.
> Yup, a unique ID should work for a file identifier.
> 
> I'm concerned about the cost of looking up the unique ID — avoiding stat 
> traffic was measured to be an important performance benefit in the dependency 
> scanner model.
> 
> To avoid a perf regression, I think you could use caches like:
> - ids: filename -> unique-id
> - originals: unique-id -> original file content
> - minimized: unique-id -> minimized file content
> 
> Where "ids" and "originals" are read/cached in lock-step when accessing a 
> filename, additionally computing "minimized" if not in the ignore-list. 
> (Adding a file to the ignore-list would put content in "ids" and "originals".)
> 
> The goal is to amortize the `stat` cost across the lifetime of the service 
> while ensuring a consistent view of the file content.
> 
> WDYT?
> 
> ... regardless I think all of this is out of scope for the current patch, 
> which is still useful for unblocking adding tests to the subsequent patches 
> in the stack.
Yes, this is the cache structure I had in mind.

I agree that this should be tackled in a follow-up patch. I'm going to create a 
patch with xfailing test case that demonstrates how one file with two different 
names (e.g. symlink) can cause issues with the current approach.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106064/new/

https://reviews.llvm.org/D106064

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D106064: [clang][deps] Normalize paths in minimizing file system

Reply via email to