[GitHub] [arrow-datafusion] 2010YOUY01 commented on pull request #6801: parallel csv scan

via GitHub Wed, 05 Jul 2023 17:10:17 -0700


2010YOUY01 commented on PR #6801:
URL: 
https://github.com/apache/arrow-datafusion/pull/6801#issuecomment-1622714624

> Hi @2010YOUY01 -- I am having trouble reproducing the benchmark results
you reported

@alamb Thank you for the feedback!
My initial benchmark run the query under different `target_partition`, just
realized that was not effective 😰
I reproduced your benchmark. Since streaming byte range get on local FS is
not implemented in Arrow yet,
>Some issue:
>1. Range get not working for local filesystem
https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355,
need to update implementation after it's fixed

alternative `get_range()` is used (which will copy the range into memory at
once instead of in a streaming fashion). It is called when finding the first
newline after the start/end of each partition, and multiple unnecessary large
disk read caused the performance issue.
This should be solved after `get_opts()` - `Range` option is supported for
local FS. For now, I suppressed this issue with a preset max line length, and
re-run the benchmark again:
This PR:
```
❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 0.894 seconds.
1 row in set. Query took 0.513 seconds.
1 row in set. Query took 0.532 seconds.
```
Main branch:
```
❯ select count(*) from lineitem where l_quantity < 10;
1 row in set. Query took 1.757 seconds.
1 row in set. Query took 1.496 seconds.
1 row in set. Query took 1.498 seconds.
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] 2010YOUY01 commented on pull request #6801: parallel csv scan

Reply via email to