[ 
https://issues.apache.org/jira/browse/HIVE-28650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902049#comment-17902049
 ] 

Sungwoo Park commented on HIVE-28650:
-------------------------------------

I ran a small experiment to see the effect of upgrading to ORC 2.0.3 (using MR3 
as the execution engine). The experiment uses a small cluster of 3 workers with 
60GB memory each and a dataset of 1TB TPC-DS dataset stored on MinIO S3 
(because Hadoop Vectored IO is useful mainly for S3). Because of the 
configuration of the cluster, the result is far from conclusive, but could be 
useful when ORC is upgraded later.

Total running times of all the TPC-DS queries:
1) ORC 1.9.4: 7782s, 7982s
2) ORC 2.0.3: 7454s, 7442s, 7372s
So, it seems like ORC 2.0.3 is better in terms of speed. This is unsurprising.

Number of S3 operations when executing query 1 to query 4.
1) ORC 1.9.4:
s3.ListObjectsV2 --> 10036 times
s3.HeadObject --> 12408 times
s3.GetObject --> 68338 times
2) ORC 2.0.3:
s3.ListObjectsV2 --> 1003 times (no difference)
s3.HeadObject --> 12408 times (no difference)
s3.GetObject --> 100758 times (huge increase)
So, ORC 2.0.3 actually executes more S3 operations than ORC 1.9.4, which is a 
bit surprising. The files in 1TB TPC-DS are relatively small, so Vectored IO 
may not be quite effective, but I was expecting that the number of S3 
operations would at least not increase.



> Upgrade Apache ORC version to 2.0.3
> -----------------------------------
>
>                 Key: HIVE-28650
>                 URL: https://issues.apache.org/jira/browse/HIVE-28650
>             Project: Hive
>          Issue Type: Improvement
>      Security Level: Public(Viewable by anyone) 
>            Reporter: Butao Zhang
>            Priority: Major
>
> ORC 2.0.x version added the Hadoop Vectored IO feature in ORC-1251.
> We can try to upgrade ORC to latest version 2.0.x to make this feature work 
> in Hive.
> But ORC 2.0.x is built on JDK17+, so we need to upgrade Hive jdk to 17+ 
> first.  This depends on this ticket HIVE-26473 upgrading jdk17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to