bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626623312



##########
File path: cpp/src/arrow/dataset/file_ipc.cc
##########
@@ -173,6 +173,20 @@ Result<ScanTaskIterator> IpcFileFormat::ScanFile(
   return IpcScanTaskIterator::Make(options, fragment);
 }
 
+Future<util::optional<int64_t>> IpcFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       Nit: let's add a `ExpressionHasNoFieldRefs()` or 
`FieldsInExpressionCount()` function so we're not constructing a vector of 
strings just to check it for emptiness

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +386,55 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+Future<util::optional<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = 
internal::checked_pointer_cast<ParquetFileFragment>(file);
+
+  auto evaluate = [=]() -> Result<util::optional<int64_t>> {
+    if (FieldsInExpression(predicate).size() > 0) {
+      ARROW_ASSIGN_OR_RAISE(auto expressions,
+                            parquet_file->TestRowGroups(std::move(predicate)));
+      int64_t rows = 0;
+      for (size_t i = 0; i < parquet_file->row_groups_->size(); i++) {
+        // Row group can definitely be eliminated
+        if (!expressions[i].IsSatisfiable()) continue;
+        // Unknown, bail out of fast path
+        if (FieldsInExpression(expressions[i]).size() > 0) return 
util::nullopt;
+
+        compute::ExecContext exec_context{options->pool};
+        ARROW_ASSIGN_OR_RAISE(
+            Datum mask, ExecuteScalarExpression(expressions[i], Datum(), 
&exec_context));
+        // Evaluated to something other than a scalar
+        if (!mask.is_scalar()) return util::nullopt;
+        const auto& mask_scalar = mask.scalar_as<BooleanScalar>();
+        // Evaluated to something other than a Boolean
+        if (!mask_scalar.is_valid) return util::nullopt;
+        // Row group can definitely be eliminated
+        if (!mask_scalar.value) continue;

Review comment:
       `TestRowGroups()` should return expressions which are as constant-folded 
as possible, and `!IsSatisfiable()` will catch `literal(false)` and (should, 
anyway) anything which is guaranteed to evaluate to null, so I think this can 
be simplified to:
   ```suggestion
           // Unless the row group is entirely included, bail out of fast path
           if (expressions[i] != literal(true)) return util::nullopt;
   ```

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +386,55 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+Future<util::optional<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = 
internal::checked_pointer_cast<ParquetFileFragment>(file);
+
+  auto evaluate = [=]() -> Result<util::optional<int64_t>> {

Review comment:
       Nit: long lambdas are less readable than a free function




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to