https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106014
Bug ID: 106014 Summary: Overload std::distance for filesystem::recursive_directory_iterator Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redi at gcc dot gnu.org Target Milestone: --- See https://www.reddit.com/r/cpp/comments/vdtlzp/comment/icn3fva/?utm_source=share&utm_medium=web2x&context=3 A user reported that a C++ program (using GCC) to count the number of files under a directory is 5x slower than the Rust equivalent. The reason is that dereferencing a directory iterator has to construct a filesystem::path, which is expensive. Avoiding the dereference improves it a lot, but it's still 2x slower. By defining a specialized overload of std::distance for filesystem::recursive_iterator we can double the speed of the C++ version and slightly beat the Rust code that uses the walkdir crate (although the jwalk crate is still faster). The custom overload looks like this: --- a/libstdc++-v3/include/bits/fs_dir.h +++ b/libstdc++-v3/include/bits/fs_dir.h @@ -44,6 +44,10 @@ namespace std _GLIBCXX_VISIBILITY(default) { _GLIBCXX_BEGIN_NAMESPACE_VERSION + ptrdiff_t + distance(filesystem::recursive_directory_iterator, + filesystem::recursive_directory_iterator); + namespace filesystem { /** @addtogroup filesystem @@ -545,6 +549,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CXX11 filesystem::remove_all(const path&, error_code&); friend uintmax_t filesystem::remove_all(const path&); + + friend ptrdiff_t + std::distance(recursive_directory_iterator, recursive_directory_iterator); }; /// @relates std::filesystem::recursive_directory_iterator @{ --- a/libstdc++-v3/src/c++17/fs_dir.cc +++ b/libstdc++-v3/src/c++17/fs_dir.cc @@ -559,3 +564,25 @@ fs::recursive_directory_iterator::__erase(error_code* ecptr) return *this; } + +std::ptrdiff_t +std::distance(fs::recursive_directory_iterator first, + fs::recursive_directory_iterator last) +{ + if (first == last) + return 0; + +#if _GLIBCXX_HAVE_DIRFD + if (last == fs::recursive_directory_iterator()) + { + first._M_dirs->options |= __directory_iterator_filename_only; + first._M_dirs->top().path.clear(); + } +#endif + + ptrdiff_t d = 0; + do + ++d; + while (++first != last); + return d; +} (Additional changes are needed to export the symbols from the shared lib). The libstdc++ code for actually walking directories is reasonably efficient, all the overhead is in creating the path objects that are required for the public API. So the trick is to avoid storing the path when we increment the iterator, because we know that it will never be needed by std::distance. This is only possible when using ::dirfd and ::openat to recurse into directories, because without those we do need the path. For an OS without those POSIX functions the custom std::distance will be no faster than the naive handwritten loop (but no slower). It should be worth doing for filesystem::directory_iterator too, but I haven't benchmarked that. I think that would just require clearing the path stored in the _Dir.