lidavidm commented on a change in pull request #9670: URL: https://github.com/apache/arrow/pull/9670#discussion_r591666392
########## File path: cpp/src/arrow/dataset/file_base.cc ########## @@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const { return repr; } -Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression predicate) { - FragmentVector fragments; +namespace { - for (const auto& fragment : fragments_) { - ARROW_ASSIGN_OR_RAISE( - auto simplified, - SimplifyWithGuarantee(predicate, fragment->partition_expression())); - if (simplified.IsSatisfiable()) { - fragments.push_back(fragment); +// Helper class for efficiently detecting subtrees given fragment partition expressions. +// Partition expressions are broken into conjunction members and each member dictionary +// encoded to impose a sortable ordering. In addition, subtrees are generated which span +// groups of fragments and nested subtrees. After encoding each fragment is guaranteed to +// be a descendant of at least one subtree. For example, given fragments in a +// HivePartitioning with paths: +// +// /num=0/al=eh/dat.par +// /num=0/al=be/dat.par +// /num=1/al=eh/dat.par +// /num=1/al=be/dat.par +// +// The following subtrees will be introduced: +// +// /num=0/ +// /num=0/al=eh/ +// /num=0/al=eh/dat.par +// /num=0/al=be/ +// /num=0/al=be/dat.par +// /num=1/ +// /num=1/al=eh/ +// /num=1/al=eh/dat.par +// /num=1/al=be/ +// /num=1/al=be/dat.par +struct SubtreeImpl { + using expression_code = char32_t; Review comment: I guess it's to work with std::basic_string, but then I'm curious why you're favoring that over std::vector. ########## File path: cpp/src/arrow/dataset/file_base.cc ########## @@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const { return repr; } -Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression predicate) { - FragmentVector fragments; +namespace { - for (const auto& fragment : fragments_) { - ARROW_ASSIGN_OR_RAISE( - auto simplified, - SimplifyWithGuarantee(predicate, fragment->partition_expression())); - if (simplified.IsSatisfiable()) { - fragments.push_back(fragment); +// Helper class for efficiently detecting subtrees given fragment partition expressions. +// Partition expressions are broken into conjunction members and each member dictionary +// encoded to impose a sortable ordering. In addition, subtrees are generated which span +// groups of fragments and nested subtrees. After encoding each fragment is guaranteed to +// be a descendant of at least one subtree. For example, given fragments in a +// HivePartitioning with paths: +// +// /num=0/al=eh/dat.par +// /num=0/al=be/dat.par +// /num=1/al=eh/dat.par +// /num=1/al=be/dat.par +// +// The following subtrees will be introduced: +// +// /num=0/ +// /num=0/al=eh/ +// /num=0/al=eh/dat.par +// /num=0/al=be/ +// /num=0/al=be/dat.par +// /num=1/ +// /num=1/al=eh/ +// /num=1/al=eh/dat.par +// /num=1/al=be/ +// /num=1/al=be/dat.par +struct SubtreeImpl { + using expression_code = char32_t; Review comment: nit: any particular reason for char32_t over say uint32_t or size_t (since it is a vector index)? And in any case, it doesn't match the static_cast below on line 176. ########## File path: cpp/src/arrow/dataset/file_base.cc ########## @@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const { return repr; } -Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression predicate) { - FragmentVector fragments; +namespace { - for (const auto& fragment : fragments_) { - ARROW_ASSIGN_OR_RAISE( - auto simplified, - SimplifyWithGuarantee(predicate, fragment->partition_expression())); - if (simplified.IsSatisfiable()) { - fragments.push_back(fragment); +// Helper class for efficiently detecting subtrees given fragment partition expressions. +// Partition expressions are broken into conjunction members and each member dictionary +// encoded to impose a sortable ordering. In addition, subtrees are generated which span +// groups of fragments and nested subtrees. After encoding each fragment is guaranteed to +// be a descendant of at least one subtree. For example, given fragments in a +// HivePartitioning with paths: +// +// /num=0/al=eh/dat.par +// /num=0/al=be/dat.par +// /num=1/al=eh/dat.par +// /num=1/al=be/dat.par +// +// The following subtrees will be introduced: +// +// /num=0/ +// /num=0/al=eh/ +// /num=0/al=eh/dat.par +// /num=0/al=be/ +// /num=0/al=be/dat.par +// /num=1/ +// /num=1/al=eh/ +// /num=1/al=eh/dat.par +// /num=1/al=be/ +// /num=1/al=be/dat.par +struct SubtreeImpl { + using expression_code = char32_t; Review comment: Ah - to get the lexicographic sort down below. I think future readers might appreciate a note about that since it's not immediately obvious + is an unusual choice of types otherwise. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org