[GitHub] [arrow] LinGeLin opened a new issue, #13403: How to speed up arrow's reading of S3 Parquet files？

GitBox Mon, 20 Jun 2022 07:12:24 -0700


LinGeLin opened a new issue, #13403:
URL: https://github.com/apache/arrow/issues/13403


   I am using Arrow to do tensorflow dataset for training. Structured data is 
stored on S3 as parquet files. I used Arrow to construct a TFIO dataset, but 
after the overall test, I found that the reading speed was slower than alluxio. 
Specific test data are as follows:
   
   alluxio store as tfrecord
   dataset base alluxio：47700 row/s, global_steps/sec: 6.5,
   dataset base arrow:  15100 row/s, global_steps/sec: 3.7
   
   In the case of reading with arrow without translating to tensor, here's the 
test：31000row/s
   Does it feel slow, or should it? Is there any way to speed it up?
   
   When reading data, the maximum speed of the Internet is 150MBs, but the 
maximum speed of the machine is 2500MBs, which is not fully utilized.
       
   My code looks something like this，If you want to go into more detail, you 
can look at this 
PR：[PR](https://github.com/tensorflow/io/pull/1685/files#diff-7133d540dc86c9bb9e552655025061798314e226695c00b4e1d8cecb178a2920)
   
   
   ```
   auto dataset = GetDatasetFromS3(K_ACCESS_KEY1, K_SECRET_KEY1, 
K_ENDPOINT_OVERRIDE1, K_BZZP);
     auto arrow_thread_pool_ = 
arrow::internal::ThreadPool::MakeEternal(16).ValueOrDie();
     auto scan_options_ = std::make_shared<arrow::dataset::ScanOptions>();
     scan_options_->use_threads = true;
     scan_options_->io_context = arrow::io::IOContext(arrow_thread_pool_.get());
     auto scanner_builder = 
std::make_shared<arrow::dataset::ScannerBuilder>(dataset, scan_options_);
   
   
     scanner_builder->Project(column_names);
     // }
     scanner_builder->BatchSize(60*1024);
     scanner_builder->UseThreads(true);
     auto scanner = scanner_builder->Finish().ValueOrDie();
     auto reader_ = scanner->ToRecordBatchReader().ValueOrDie();
     std::shared_ptr<arrow::RecordBatch> current_batch_ = nullptr;
     reader_->ReadNext(&current_batch_);
     long total_count = 0;
     total_count += current_batch_->num_rows();
     while(current_batch_) {
       reader_->ReadNext(&current_batch_);
       if (current_batch_) {
         total_count += current_batch_->num_rows();
         std::cout << "row: " << current_batch_->num_rows();
       }
     }
     std::cout << std::endl;
     std::cout << "Total rows: " << total_count << std::endl;
       end = clock();
       double endtime = (double)(end-start)/CLOCKS_PER_SEC;
       std::cout << "Total time: " << endtime << "s"<< std::endl;
   std::cout << "Speed: " << total_count / endtime << " rows/s" << std::endl;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] LinGeLin opened a new issue, #13403: How to speed up arrow's reading of S3 Parquet files？

Reply via email to