[GitHub] [arrow] YoungRX opened a new issue, #35000: How to make Scanner read parquet files faster?

via GitHub Mon, 10 Apr 2023 06:34:00 -0700


YoungRX opened a new issue, #35000:
URL: https://github.com/apache/arrow/issues/35000


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I used `ParquetFileReader` in `/parquet/file_reader.h` to read parquet file 
before. And I implemented the predicate push-down myself. 
   
   Now I am using 8.0.0. And I update the code to use 
`AsyncScanner::ToRecordBatchReader()` and 
`ScannerRecordBatchReader::ReadNext()` to read the parquet files. So I can use 
the predicate pushdown implemented internally by arrow. 
   
   However, my code environment does not support multithreading, so I set up 
the following in `ScanOptions`:
   > use_threads = false;
   > batch_readahead = 0;
   > batch_size = 1000;
   > Other settings such as filter, projection, dataset_schema are set as 
required
   
   As a result, when scanning the same parquet file with the same sql 
statement, the new code takes 1.5 to 2.0 times longer to execute than the old 
code. I think it is unreasonable. 
   
   Is there an option I have that is not set correctly?
   Or is it because multithreading and readahead are not enabled?
   Do you have a way to make `Scanner` faster?
   
   
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] YoungRX opened a new issue, #35000: How to make Scanner read parquet files faster?

Reply via email to