Stamatis Zampetakis created HIVE-27465:
------------------------------------------
Summary: Binary distribution contains multiple identical classes
in different jars
Key: HIVE-27465
URL: https://issues.apache.org/jira/browse/HIVE-27465
Project: Hive
Issue Type: Bug
Reporter: Stamatis Zampetakis
Attachments: all_lib_dup_classes_sorted.txt
The problem exists in current master but is also present in previous Hive
releases.
Consider for example the 4.0.0-alpha-2 release.
Download and untar the respective archive:
https://dlcdn.apache.org/hive/hive-4.0.0-alpha-2/apache-hive-4.0.0-alpha-2-bin.tar.gz
Inspect the lib directory and observe that some classes can be found both
inside the hive-exec-4.0.0-alpha-2.jar and also in the original jar of the
dependency.
For instance check the AvaticaUtils.class:
{noformat}
jar tf hive-exec-4.0.0-alpha-2.jar | grep
org/apache/calcite/avatica/AvaticaUtils.class
jar tf avatica-1.12.0.jar | grep org/apache/calcite/avatica/AvaticaUtils.class
{noformat}
This is not specific to avatica but appears also for other dependencies. It
comes from the fact that hive-exec module shades a lot of dependencies but at
the same time the maven-assembly-plugin that is used to build the binary
distribution copies all these dependencies under the lib directory.
As long as we use the same version of the class in both places this shouldn't
be a big problem. However, there are still some inconveniences in having
duplicate classes:
* Increases the size of the binary distro
* Consumes more file descriptors
* Increases likelihood of classpath problems
In current master the problem can be see by running:
{noformat}
mvn clean package -Pdist -Piceberg -Pitests -DskipTests
{noformat}
And then inspecting the jars in the generated bin directory:
{noformat}
find
packaging/target/apache-hive-4.0.0-beta-1-SNAPSHOT-bin/apache-hive-4.0.0-beta-1-SNAPSHOT-bin
-name "*.jar" -exec jar tf {} \; | grep ".class" | sort | uniq -c | grep -v
"[[:space:]]\+1 " | sort -n -k1 -r > all_lib_dup_classes_sorted.txt
{noformat}
There are roughly 30K classes that appear more than once in various jars.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)