litao3rd opened a new issue, #37840:
URL: https://github.com/apache/arrow/issues/37840
### Describe the bug, including details regarding any error messages,
version, and platform.
I'm learning to use arrows with the C++ language. It's possible that this
issue isn't a bug but rather a result of incorrect practices, but I'm not
certain.
The code below utilizes the "tlc-trip-record-data" dataset, which consists
of 264 parquet files that I've downloaded. My objective is to calculate the
total number of rows in the dataset. However, as demonstrated below, I've
encountered varying results when working with this dataset. Your assistance
would be greatly appreciated.
```
#include <iostream>
#include <arrow/api.h>
#include <arrow/dataset/api.h>
#include <arrow/filesystem/api.h>
#include <arrow/dataset/file_base.h>
#include <arrow/dataset/file_parquet.h>
namespace ds = arrow::dataset;
namespace fs = arrow::fs;
namespace cp = arrow::compute;
int main(int argc, char **argv)
{
auto filesystem = std::make_shared<fs::LocalFileSystem>();
auto format = std::make_shared<ds::ParquetFileFormat>();
const std::string base_dir = "/home/wulitao/data/tlc-trip-record-data";
arrow::Status status;
fs::FileSelector selector;
selector.base_dir = base_dir;
selector.recursive = true;
auto factory = ds::FileSystemDatasetFactory::Make(filesystem, selector,
format, ds::FileSystemFactoryOptions())
.ValueOrDie();
auto dataset = factory->Finish().ValueOrDie();
auto sb = dataset->NewScan().ValueOrDie();
sb->UseThreads(false);
auto scanner = sb->Finish().ValueOrDie();
{
// In this block I got total 1526807659 rows
std::cout << "total count rows() = " <<
scanner->CountRows().ValueOrDie() << "\n";
}
{
// In this block I got total 57410540 rows
int64_t total_rows = 0;
status = scanner->Scan([&](ds::TaggedRecordBatch batch) ->
arrow::Status {
total_rows += batch.record_batch->num_rows();
return arrow::Status::OK();
});
std::cout << "total count rows in visitor mode = " << total_rows <<
"\n";
}
return 0;
}
```
Output:
```
total count rows() = 1526807659
total count rows in visitor mode = 57410540
```
Thanks for any reply.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]