bkietz commented on a change in pull request #9670:
URL: https://github.com/apache/arrow/pull/9670#discussion_r591895955



##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -137,16 +132,174 @@ std::string FileSystemDataset::ToString() const {
   return repr;
 }
 
-Result<FragmentIterator> FileSystemDataset::GetFragmentsImpl(Expression 
predicate) {
-  FragmentVector fragments;
+namespace {
 
-  for (const auto& fragment : fragments_) {
-    ARROW_ASSIGN_OR_RAISE(
-        auto simplified,
-        SimplifyWithGuarantee(predicate, fragment->partition_expression()));
-    if (simplified.IsSatisfiable()) {
-      fragments.push_back(fragment);
+// Helper class for efficiently detecting subtrees given fragment partition 
expressions.
+// Partition expressions are broken into conjunction members and each member 
dictionary
+// encoded to impose a sortable ordering. In addition, subtrees are generated 
which span
+// groups of fragments and nested subtrees. After encoding each fragment is 
guaranteed to
+// be a descendant of at least one subtree. For example, given fragments in a
+// HivePartitioning with paths:
+//
+//   /num=0/al=eh/dat.par
+//   /num=0/al=be/dat.par
+//   /num=1/al=eh/dat.par
+//   /num=1/al=be/dat.par
+//
+// The following subtrees will be introduced:
+//
+//   /num=0/
+//   /num=0/al=eh/
+//   /num=0/al=eh/dat.par
+//   /num=0/al=be/
+//   /num=0/al=be/dat.par
+//   /num=1/
+//   /num=1/al=eh/
+//   /num=1/al=eh/dat.par
+//   /num=1/al=be/
+//   /num=1/al=be/dat.par
+struct SubtreeImpl {
+  using expression_code = char32_t;

Review comment:
       Ideally this hack would be replaced by a single buffer of expression 
codes indexed into by `SubtreeImpl::Encoded::partition_expression` and friends. 
This would give us a guarantee that even "deep" partitionings would not require 
lots of small buffers at the cost of complicating hashing and comparison




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to