Ted-Jiang commented on PR #3769:
URL:
https://github.com/apache/arrow-datafusion/pull/3769#issuecomment-1272454236
@thinkharderdev thanks for your great bench.
I run parquet tools in local get (1.0 GB)
```
(venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index
./logs.parquet
row group 0:
column index for column service:
Boudary order: UNORDERED
null count min
max
page-0 0 backend
frontend
offset index for column service:
offset compressed size first row index
page-0 62 117 0
column index for column host:
Boudary order: UNORDERED
null count min
max
page-0 0 i-1ec3ca3151468928.ec2.internal
i-1ec408f54dbd3750.ec2.internal
offset index for column host:
offset compressed size first row index
page-0 566 125 0
column index for column pod:
Boudary order: UNORDERED
null count min
max
page-0 0 aejowuublavflbbsvlfozigwpmrxldvhaollk
zxxlzhdrucrhpicpdgxtfpyuknvviimggtq
offset index for column pod:
offset compressed size first row index
page-0 6689 602 0
column index for column container:
Boudary order: UNORDERED
null count min
max
page-0 0 backend_container_0
frontend_container_1
offset index for column container:
offset compressed size first row index
page-0 7602 593 0
```
There are at most two pages in one col, I think if we adjust to get more
pages in one col (like reduce the page size), it will get greater performance
in enable `enable_page_index `, we can get more opportunitys to skip whole
pages without decoding! 🤔
FYI, i see impala choose to use fixed row number in one page to do
benchmark for getting good performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]