pitrou commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r643145474
##########
File path: cpp/src/arrow/csv/reader.h
##########
@@ -96,5 +96,11 @@ class ARROW_EXPORT StreamingReader : public
RecordBatchReader {
const ConvertOptions& convert_options);
};
+/// \brief Count the rows in a CSV file.
+ARROW_EXPORT
+Future<int64_t> CountRows(io::IOContext io_context,
Review comment:
Nit: perhaps call this `CountRowsAsync`?
##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -1081,6 +1145,16 @@ Future<std::shared_ptr<StreamingReader>>
StreamingReader::MakeAsync(
parse_options, convert_options);
}
+Future<int64_t> CountRows(io::IOContext io_context,
+ std::shared_ptr<io::InputStream> input,
+ const ReadOptions& read_options,
+ const ParseOptions& parse_options) {
+ auto cpu_executor = internal::GetCpuThreadPool();
Review comment:
Is there a reason the CPU executor isn't a `CountRows` parameter?
##########
File path: cpp/src/arrow/dataset/file_csv.h
##########
@@ -61,6 +61,10 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
const std::shared_ptr<ScanOptions>& scan_options,
const std::shared_ptr<FileFragment>& file) const override;
+ Future<util::optional<int64_t>> CountRows(
+ const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+ std::shared_ptr<ScanOptions> options) override;
Review comment:
Is there a reason that some methods take `const
std::shared_ptr<ScanOptions>&` and this one `std::shared_ptr<ScanOptions>`?
##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -50,7 +50,10 @@ class CsvFormatHelper {
}
static std::shared_ptr<CsvFileFormat> MakeFormat() {
- return std::make_shared<CsvFileFormat>();
+ auto format = std::make_shared<CsvFileFormat>();
+ // Required for CountRows
+ format->parse_options.ignore_empty_lines = false;
Review comment:
I'm not sure I understand this change. Is `CountRows` supposed to count
logical rows of data, or physical rows inside the file (even if they may be
ignored as empty)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]