[jira] [Commented] (HIVE-28650) Upgrade Apache ORC version to 2.0.3

Sungwoo Park (Jira) Sun, 08 Dec 2024 07:26:32 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903922#comment-17903922
 ]


Sungwoo Park commented on HIVE-28650:
-------------------------------------

I ran a simple experiment in the same small cluster, by changing the parameters 
fs.s3a.vectored.read.min.seek.size and fs.s3a.vectored.read.max.merged.size.
The experiment used ORC 2.0.3, and ran TPC-DS query 1 to 4, or the entire 99 
queries. The data size is TPC-DS 1TB, but the running time is not significant 
because our cluster uses just a single MinIO server.

=== Experiment 1 (default configuration)
fs.s3a.vectored.read.min.seek.size=4K
fs.s3a.vectored.read.max.merged.size=1M

TPC-DS query 1 to 4: 225.15s
TPC-DS 99 queries: 7454s, 7442s, 7372s

# of s3.ListObjectsV2 = 10039
# of s3.HeadObject = 11275
# of s3.GetObject = 100665 
Average data size in s3.GetObject: 664215.97

=== Experiment 2
fs.s3a.vectored.read.min.seek.size=256K
fs.s3a.vectored.read.max.merged.size=2M

TPC-DS query 1 to 4: 230.172s

# of s3.ListObjectsV2 = 10036 
# of s3.HeadObject = 11263
# of s3.GetObject = 93936 
Average data size in s3.GetObject: 711811.89

=== Experiment 3
fs.s3a.vectored.read.min.seek.size=512K
fs.s3a.vectored.read.max.merged.size=4M

TPC-DS query 1 to 4: 222.783s
TPC-DS 99 queries: 7649.588s, 7333.503s

# of s3.ListObjectsV2 = 10036
# of s3.HeadObject = 11266
# of s3.GetObject = 76013
Average data size in s3.GetObject: 880055.84

As expected, increasing 
fs.s3a.vectored.read.min.seek.size/fs.s3a.vectored.read.max.merged.size reduces 
the number of s3.GetObject operations, while increasing the average data size 
in each s3.GetObject operation. So, what I can confirm from the experiment is 
that Vectored IO seems to work correctly. 


> Upgrade Apache ORC version to 2.0.3
> -----------------------------------
>
>                 Key: HIVE-28650
>                 URL: https://issues.apache.org/jira/browse/HIVE-28650
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Butao Zhang
>            Priority: Major
>
> ORC 2.0.x version added the Hadoop Vectored IO feature in ORC-1251.
> We can try to upgrade ORC to latest version 2.0.x to make this feature work 
> in Hive.
> But ORC 2.0.x is built on JDK17+, so we need to upgrade Hive jdk to 17+ 
> first.  This depends on this ticket HIVE-26473 upgrading jdk17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28650) Upgrade Apache ORC version to 2.0.3

Reply via email to