[PATCH] D114966: [clang][deps] Split stat and file content caches

Duncan P. N. Exon Smith via Phabricator via cfe-commits Mon, 20 Dec 2021 10:31:48 -0800

dexonsmith requested changes to this revision.
dexonsmith added a comment.
This revision now requires changes to proceed.


I'm liking the new direction here; requesting changes since it looks like the 
filename is being used to pick the shard for UIDMap, which will lead to 
multiple opinions of what each UID means.



================
Comment at: 
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:332
+
+    TentativeEntry(llvm::vfs::Status Status,
+                   std::unique_ptr<llvm::MemoryBuffer> Contents)
----------------
I'd use `const&` to avoid copying the string on the way in... see below.


================
Comment at: 
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:334
+                   std::unique_ptr<llvm::MemoryBuffer> Contents)
+        : Status(std::move(Status)), Contents(std::move(Contents)) {}
+  };
----------------
I think, rather than move/copy the status name, the name should be wiped out to 
ensure no one relies on it. Every access should use `copyWithNewName()` since 
this is shared across all things that point to the same UID... so let's use 
`copyWithNewName()` here to drop the ignored name.


================
Comment at: 
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:359-360
+  findSharedEntryByUID(llvm::vfs::Status Stat) const {
+    return SharedCache.getShard(Stat.getName())
+        .findEntryByUID(Stat.getUniqueID());
+  }
----------------
This doesn't look right to me. UIDs should be sharded independently of the 
filename they happen to have been reached by; otherwise each filename shard is 
developing its own idea of what each UID means. Since UID distribution is not 
uniform, probably the UID shard should be chosen by 
`hash_value(Stat.getUniqueID()) % NumShards`.

You could use the same sets of shards for UIDMap and FilenameMap, but since 
they're independent I'd probably do:
- UIDCache: sharded by UID: UIDMap and BumpPtrAllocator for entries (and likely 
anything else tied to content)
- FilenameCache: sharded by filename: FilenameMap (and perhaps other things 
tied to filename?)


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:23-24
+  auto MaybeFile = getUnderlyingFS().openFileForRead(Filename);
   if (!MaybeFile)
     return MaybeFile.getError();
   auto File = std::move(*MaybeFile);
----------------
In what circumstances should this return a cached-error TentativeEntry? Any?


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:28-29
   auto MaybeStat = File->status();
   if (!MaybeStat)
     return MaybeStat.getError();
   auto Stat = std::move(*MaybeStat);
----------------
Since the file was opened, should we return cached-error TentativeEntry here, 
rather than an error?


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:33-34
   auto MaybeBuffer = File->getBuffer(Stat.getName());
   if (!MaybeBuffer)
     return MaybeBuffer.getError();
   auto Buffer = std::move(*MaybeBuffer);
----------------
After a successful stat on the same file descriptor, it definitely feels like 
this is an error that should be cached, and a TentativeEntry that is in an 
error state should be returned.


================
Comment at: 
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:275
+  const auto &Entry1 = getOrEmplaceSharedEntryForUID(std::move(TEntry));
+  const auto &Entry2 = getOrInsertSharedEntryForFilename(Filename, Entry1);
+  return insertLocalEntryForFilename(Filename, Entry2);
----------------
jansvoboda11 wrote:
> I'm not sure these should be separate. We could end up in situation where the 
> Filename map contains different entry than the UID map for the same directory 
> entry. I'm tempted to merge these functions into one and perform the updates 
> in a single critical section...
> I'm not sure these should be separate. We could end up in situation where the 
> Filename map contains different entry than the UID map for the same directory 
> entry.

I'm also sure precisely what you mean by "for the same directory entry" in this 
context; and I don't see what's wrong with the situation I think you're 
outlining.

> I'm tempted to merge these functions into one and perform the updates in a 
> single critical section...

A single critical section for setting UID and filename at the same time would 
be hard to get right (and efficient), since UIDs have aliases through other 
filenames due to different directory paths (dir/../x.h vs x.h) and filesystem 
links (hard and symbolic).

Here's the race that I think(?) you're worried about:

- Worker1 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, 
but it's ignored...).
- Worker2 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, 
but it's ignored...).
- Worker1 opens "x.h", finds ContentX1+StatX1 (with UIDX1), saves mapping UIDX1 
-> ContentX1+StatX1.
- "x.h" changes.
- Worker2 opens "x.h", finds ContentX2+StatX2 (with UIDX2), saves mapping UIDX2 
-> ContentX2+StatX2.
- Worker2 saves mapping "x.h" -> ContentX2+StatX2.
- Both workers move forward with "x.h" -> ContentX2+StatX2.

IIUC, you're concerned that the mapping UIDX1 -> ContentX1+StatX1 was saved. 
The side effect is that if a future tentative stat of (e.g.) "y.h" returns 
UIDX1, then "y.h" will be mapped to ContentX1+StatX1. Is this what concerns 
you? Why? (Is it something else?)

The concern I have is that some filesystems recycle UIDs (maybe "x.h" *was* a 
symbolic link to "y.h" and then became its own file... or maybe "x.h" and "y.h" 
were hard links... or maybe "y.h" is just a new file!). But that's a problem 
with using UIDs to detect equivalent filesystem links / content in general. I 
don't see any reason to be more concerned here than elsewhere, and to avoid 
depending on UID we'd need a pretty different design (e.g., lazily detect and 
model directory structure and symbolic links).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114966/new/

https://reviews.llvm.org/D114966

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D114966: [clang][deps] Split stat and file content caches

Reply via email to