Stamatis Zampetakis created HIVE-27465:
------------------------------------------

             Summary: Binary distribution contains multiple identical classes 
in different jars
                 Key: HIVE-27465
                 URL: https://issues.apache.org/jira/browse/HIVE-27465
             Project: Hive
          Issue Type: Bug
            Reporter: Stamatis Zampetakis
         Attachments: all_lib_dup_classes_sorted.txt

The problem exists in current master but is also present in previous Hive 
releases.

Consider for example the 4.0.0-alpha-2 release.

Download and untar the respective archive:
https://dlcdn.apache.org/hive/hive-4.0.0-alpha-2/apache-hive-4.0.0-alpha-2-bin.tar.gz

Inspect the lib directory and observe that some classes can be found both 
inside the hive-exec-4.0.0-alpha-2.jar and also in the original jar of the 
dependency.

For instance check the AvaticaUtils.class:
{noformat}
jar tf  hive-exec-4.0.0-alpha-2.jar | grep 
org/apache/calcite/avatica/AvaticaUtils.class
jar tf  avatica-1.12.0.jar | grep org/apache/calcite/avatica/AvaticaUtils.class
{noformat}

This is not specific to avatica but appears also for other dependencies. It 
comes from the fact that hive-exec module shades a lot of dependencies but at 
the same time the maven-assembly-plugin that is used to build the binary 
distribution copies all these dependencies under the lib directory.

As long as we use the same version of the class in both places this shouldn't 
be a big problem. However, there are still some inconveniences in having 
duplicate classes:
* Increases the size of the binary distro
* Consumes more file descriptors
* Increases likelihood of classpath problems

In current master the problem can be see by running:

{noformat}
mvn clean package -Pdist -Piceberg -Pitests -DskipTests
{noformat}

And then inspecting the jars in the generated bin directory:
{noformat}
find 
packaging/target/apache-hive-4.0.0-beta-1-SNAPSHOT-bin/apache-hive-4.0.0-beta-1-SNAPSHOT-bin
 -name "*.jar" -exec jar tf {} \; | grep ".class" | sort | uniq -c | grep -v 
"[[:space:]]\+1 " | sort -n -k1 -r > all_lib_dup_classes_sorted.txt 
{noformat}

There are roughly 30K classes that appear more than once in various jars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to