[
https://issues.apache.org/jira/browse/HIVE-27465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-27465:
---------------------------------------
Attachment: all_lib_dup_classes_sorted.txt
> Binary distribution contains multiple identical classes in different jars
> -------------------------------------------------------------------------
>
> Key: HIVE-27465
> URL: https://issues.apache.org/jira/browse/HIVE-27465
> Project: Hive
> Issue Type: Bug
> Reporter: Stamatis Zampetakis
> Priority: Major
> Attachments: all_lib_dup_classes_sorted.txt
>
>
> The problem exists in current master but is also present in previous Hive
> releases.
> Consider for example the 4.0.0-alpha-2 release.
> Download and untar the respective archive:
> https://dlcdn.apache.org/hive/hive-4.0.0-alpha-2/apache-hive-4.0.0-alpha-2-bin.tar.gz
> Inspect the lib directory and observe that some classes can be found both
> inside the hive-exec-4.0.0-alpha-2.jar and also in the original jar of the
> dependency.
> For instance check the AvaticaUtils.class:
> {noformat}
> jar tf hive-exec-4.0.0-alpha-2.jar | grep
> org/apache/calcite/avatica/AvaticaUtils.class
> jar tf avatica-1.12.0.jar | grep
> org/apache/calcite/avatica/AvaticaUtils.class
> {noformat}
> This is not specific to avatica but appears also for other dependencies. It
> comes from the fact that hive-exec module shades a lot of dependencies but at
> the same time the maven-assembly-plugin that is used to build the binary
> distribution copies all these dependencies under the lib directory.
> As long as we use the same version of the class in both places this shouldn't
> be a big problem. However, there are still some inconveniences in having
> duplicate classes:
> * Increases the size of the binary distro
> * Consumes more file descriptors
> * Increases likelihood of classpath problems
> In current master the problem can be see by running:
> {noformat}
> mvn clean package -Pdist -Piceberg -Pitests -DskipTests
> {noformat}
> And then inspecting the jars in the generated bin directory:
> {noformat}
> find
> packaging/target/apache-hive-4.0.0-beta-1-SNAPSHOT-bin/apache-hive-4.0.0-beta-1-SNAPSHOT-bin
> -name "*.jar" -exec jar tf {} \; | grep ".class" | sort | uniq -c | grep -v
> "[[:space:]]\+1 " | sort -n -k1 -r > all_lib_dup_classes_sorted.txt
> {noformat}
> There are roughly 30K classes that appear more than once in various jars.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)