https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106014

            Bug ID: 106014
           Summary: Overload std::distance for
                    filesystem::recursive_directory_iterator
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redi at gcc dot gnu.org
  Target Milestone: ---

See
https://www.reddit.com/r/cpp/comments/vdtlzp/comment/icn3fva/?utm_source=share&utm_medium=web2x&context=3

A user reported that a C++ program (using GCC) to count the number of files
under a directory is 5x slower than the Rust equivalent. The reason is that
dereferencing a directory iterator has to construct a filesystem::path, which
is expensive. Avoiding the dereference improves it a lot, but it's still 2x
slower.

By defining a specialized overload of std::distance for
filesystem::recursive_iterator we can double the speed of the C++ version and
slightly beat the Rust code that uses the walkdir crate (although the jwalk
crate is still faster).


The custom overload looks like this:

--- a/libstdc++-v3/include/bits/fs_dir.h
+++ b/libstdc++-v3/include/bits/fs_dir.h
@@ -44,6 +44,10 @@ namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION

+  ptrdiff_t
+  distance(filesystem::recursive_directory_iterator,
+          filesystem::recursive_directory_iterator);
+
 namespace filesystem
 {
   /** @addtogroup filesystem
@@ -545,6 +549,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CXX11
     filesystem::remove_all(const path&, error_code&);
     friend uintmax_t
     filesystem::remove_all(const path&);
+
+    friend ptrdiff_t
+    std::distance(recursive_directory_iterator, recursive_directory_iterator);
   };

   /// @relates std::filesystem::recursive_directory_iterator @{


--- a/libstdc++-v3/src/c++17/fs_dir.cc
+++ b/libstdc++-v3/src/c++17/fs_dir.cc
@@ -559,3 +564,25 @@ fs::recursive_directory_iterator::__erase(error_code*
ecptr)

   return *this;
 }
+
+std::ptrdiff_t
+std::distance(fs::recursive_directory_iterator first,
+             fs::recursive_directory_iterator last)
+{
+  if (first == last)
+    return 0;
+
+#if _GLIBCXX_HAVE_DIRFD
+  if (last == fs::recursive_directory_iterator())
+    {
+      first._M_dirs->options |= __directory_iterator_filename_only;
+      first._M_dirs->top().path.clear();
+    }
+#endif
+
+  ptrdiff_t d = 0;
+  do
+    ++d;
+  while (++first != last);
+  return d;
+}


(Additional changes are needed to export the symbols from the shared lib).

The libstdc++ code for actually walking directories is reasonably efficient,
all the overhead is in creating the path objects that are required for the
public API. So the trick is to avoid storing the path when we increment the
iterator, because we know that it will never be needed by std::distance. This
is only possible when using ::dirfd and ::openat to recurse into directories,
because without those we do need the path. For an OS without those POSIX
functions the custom std::distance will be no faster than the naive handwritten
loop (but no slower).

It should be worth doing for filesystem::directory_iterator too, but I haven't
benchmarked that. I think that would just require clearing the path stored in
the _Dir.

Reply via email to