[
https://issues.apache.org/jira/browse/HIVE-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903922#comment-17903922
]
Sungwoo Park commented on HIVE-28650:
-------------------------------------
I ran a simple experiment in the same small cluster, by changing the parameters
fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size.
The experiment used ORC 2.0.3, and ran TPC-DS query 1 to 4, or the entire 99
queries. The data size is TPC-DS 1TB, but the running time is not significant
because our cluster uses just a single MinIO server.
=== Experiment 1 (default configuration)
fs.s3a.vectored.read.min.seek.size=4K
fs.s3a.vectored.read.max.merged.size=1M
TPC-DS query 1 to 4: 225.15s
TPC-DS 99 queries: 7454s, 7442s, 7372s
# of s3.ListObjectsV2 = 10039
# of s3.HeadObject = 11275
# of s3.GetObject = 100665
Average data size in s3.GetObject: 664215.97
=== Experiment 2
fs.s3a.vectored.read.min.seek.size=256K
fs.s3a.vectored.read.max.merged.size=2M
TPC-DS query 1 to 4: 230.172s
# of s3.ListObjectsV2 = 10036
# of s3.HeadObject = 11263
# of s3.GetObject = 93936
Average data size in s3.GetObject: 711811.89
=== Experiment 3
fs.s3a.vectored.read.min.seek.size=512K
fs.s3a.vectored.read.max.merged.size=4M
TPC-DS query 1 to 4: 222.783s
TPC-DS 99 queries: 7649.588s, 7333.503s
# of s3.ListObjectsV2 = 10036
# of s3.HeadObject = 11266
# of s3.GetObject = 76013
Average data size in s3.GetObject: 880055.84
As expected, increasing
fs.s3a.vectored.read.min.seek.size/fs.s3a.vectored.read.max.merged.size reduces
the number of s3.GetObject operations, while increasing the average data size
in each s3.GetObject operation. So, what I can confirm from the experiment is
that Vectored IO seems to work correctly.
> Upgrade Apache ORC version to 2.0.3
> -----------------------------------
>
> Key: HIVE-28650
> URL: https://issues.apache.org/jira/browse/HIVE-28650
> Project: Hive
> Issue Type: Improvement
> Reporter: Butao Zhang
> Priority: Major
>
> ORC 2.0.x version added the Hadoop Vectored IO feature in ORC-1251.
> We can try to upgrade ORC to latest version 2.0.x to make this feature work
> in Hive.
> But ORC 2.0.x is built on JDK17+, so we need to upgrade Hive jdk to 17+
> first. This depends on this ticket HIVE-26473 upgrading jdk17.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)