2010YOUY01 commented on PR #6801:
URL: 
https://github.com/apache/arrow-datafusion/pull/6801#issuecomment-1622714624

   > Hi @2010YOUY01 -- I am having trouble reproducing the benchmark results 
you reported
   
   @alamb Thank you for the feedback!
   My initial benchmark run the query under different `target_partition`, just 
realized that was not effective 😰 
   I reproduced your benchmark. Since streaming byte range get on local FS is 
not implemented in Arrow yet, 
   >Some issue:
   >1. Range get not working for local filesystem 
https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355,
 need to update implementation after it's fixed
   
   alternative `get_range()` is used (which will copy the range into memory at 
once instead of in a streaming fashion). It is called when finding the first 
newline after the start/end of each partition, and multiple unnecessary large 
disk read caused the performance issue.
   This should be solved after `get_opts()` - `Range` option is supported for 
local FS. For now, I suppressed this issue with a preset max line length, and 
re-run the benchmark again:
   This PR:
   ```
   ❯ select count(*) from lineitem where l_quantity < 10;
   1 row in set. Query took 0.894 seconds.
   1 row in set. Query took 0.513 seconds.
   1 row in set. Query took 0.532 seconds.
   ```
   Main branch:
   ```
   ❯ select count(*) from lineitem where l_quantity < 10;
   1 row in set. Query took 1.757 seconds.
   1 row in set. Query took 1.496 seconds.
   1 row in set. Query took 1.498 seconds.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to