LinGeLin opened a new issue, #13403:
URL: https://github.com/apache/arrow/issues/13403
I am using Arrow to do tensorflow dataset for training. Structured data is
stored on S3 as parquet files. I used Arrow to construct a TFIO dataset, but
after the overall test, I found that the reading speed was slower than alluxio.
Specific test data are as follows:
alluxio store as tfrecord
dataset base alluxio:47700 row/s, global_steps/sec: 6.5,
dataset base arrow: 15100 row/s, global_steps/sec: 3.7
In the case of reading with arrow without translating to tensor, here's the
test:31000row/s
Does it feel slow, or should it? Is there any way to speed it up?
When reading data, the maximum speed of the Internet is 150MBs, but the
maximum speed of the machine is 2500MBs, which is not fully utilized.
My code looks something like this,If you want to go into more detail, you
can look at this
PR:[PR](https://github.com/tensorflow/io/pull/1685/files#diff-7133d540dc86c9bb9e552655025061798314e226695c00b4e1d8cecb178a2920)
```
auto dataset = GetDatasetFromS3(K_ACCESS_KEY1, K_SECRET_KEY1,
K_ENDPOINT_OVERRIDE1, K_BZZP);
auto arrow_thread_pool_ =
arrow::internal::ThreadPool::MakeEternal(16).ValueOrDie();
auto scan_options_ = std::make_shared<arrow::dataset::ScanOptions>();
scan_options_->use_threads = true;
scan_options_->io_context = arrow::io::IOContext(arrow_thread_pool_.get());
auto scanner_builder =
std::make_shared<arrow::dataset::ScannerBuilder>(dataset, scan_options_);
scanner_builder->Project(column_names);
// }
scanner_builder->BatchSize(60*1024);
scanner_builder->UseThreads(true);
auto scanner = scanner_builder->Finish().ValueOrDie();
auto reader_ = scanner->ToRecordBatchReader().ValueOrDie();
std::shared_ptr<arrow::RecordBatch> current_batch_ = nullptr;
reader_->ReadNext(¤t_batch_);
long total_count = 0;
total_count += current_batch_->num_rows();
while(current_batch_) {
reader_->ReadNext(¤t_batch_);
if (current_batch_) {
total_count += current_batch_->num_rows();
std::cout << "row: " << current_batch_->num_rows();
}
}
std::cout << std::endl;
std::cout << "Total rows: " << total_count << std::endl;
end = clock();
double endtime = (double)(end-start)/CLOCKS_PER_SEC;
std::cout << "Total time: " << endtime << "s"<< std::endl;
std::cout << "Speed: " << total_count / endtime << " rows/s" << std::endl;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]