[ 
https://issues.apache.org/jira/browse/HIVE-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-27465:
---------------------------------------
    Attachment: all_lib_dup_classes_sorted.txt

> Binary distribution contains multiple identical classes in different jars
> -------------------------------------------------------------------------
>
>                 Key: HIVE-27465
>                 URL: https://issues.apache.org/jira/browse/HIVE-27465
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Stamatis Zampetakis
>            Priority: Major
>         Attachments: all_lib_dup_classes_sorted.txt
>
>
> The problem exists in current master but is also present in previous Hive 
> releases.
> Consider for example the 4.0.0-alpha-2 release.
> Download and untar the respective archive:
> https://dlcdn.apache.org/hive/hive-4.0.0-alpha-2/apache-hive-4.0.0-alpha-2-bin.tar.gz
> Inspect the lib directory and observe that some classes can be found both 
> inside the hive-exec-4.0.0-alpha-2.jar and also in the original jar of the 
> dependency.
> For instance check the AvaticaUtils.class:
> {noformat}
> jar tf  hive-exec-4.0.0-alpha-2.jar | grep 
> org/apache/calcite/avatica/AvaticaUtils.class
> jar tf  avatica-1.12.0.jar | grep 
> org/apache/calcite/avatica/AvaticaUtils.class
> {noformat}
> This is not specific to avatica but appears also for other dependencies. It 
> comes from the fact that hive-exec module shades a lot of dependencies but at 
> the same time the maven-assembly-plugin that is used to build the binary 
> distribution copies all these dependencies under the lib directory.
> As long as we use the same version of the class in both places this shouldn't 
> be a big problem. However, there are still some inconveniences in having 
> duplicate classes:
> * Increases the size of the binary distro
> * Consumes more file descriptors
> * Increases likelihood of classpath problems
> In current master the problem can be see by running:
> {noformat}
> mvn clean package -Pdist -Piceberg -Pitests -DskipTests
> {noformat}
> And then inspecting the jars in the generated bin directory:
> {noformat}
> find 
> packaging/target/apache-hive-4.0.0-beta-1-SNAPSHOT-bin/apache-hive-4.0.0-beta-1-SNAPSHOT-bin
>  -name "*.jar" -exec jar tf {} \; | grep ".class" | sort | uniq -c | grep -v 
> "[[:space:]]\+1 " | sort -n -k1 -r > all_lib_dup_classes_sorted.txt 
> {noformat}
> There are roughly 30K classes that appear more than once in various jars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to