Hi, To verify this we have our current "no environment test" in python this verify what packages are needed vs not.
Unfortunately right now our main branch fails because of missing packages. So i am making it bigger Ideally we would not need to pack any of the hadoop things into the python package. Currently the system require hadoop jars because we import hadoop packages many places in our code base where it could potentially be avoided. best regards Sebastian ________________________________ From: Janardhan <janard...@apache.org> Sent: Thursday, June 23, 2022 5:14:24 PM To: dev@systemds.apache.org Subject: Re: [DISCUSS] PyPi packages are more than 100 MB. Hi team, In the list attached before, the following 19MB - hadoop-client-api-3.3.1.jar [1] 31MB - hadoop-client-runtime-3.3.1.jar [2] are added, which are introduced in the Hadoop 3.x. These jars are added to the bin packaging, with `<include>*:hadoop-client*</include>`[3] line the bin.xml. It has not changed recently. Are these libraries intentional and important for binary release. Is it possible to remove them? [1] https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api [2] https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime [3] https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100 [https://opengraph.githubassets.com/4346ffcbfafaa80de9f253ffae4064695cb243ef464a0b841cc5e00ee05f127b/apache/systemds]<https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100> systemds/src/assembly/bin.xml at main · apache/systemds · GitHub<https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100> github.com An open source ML system for the end-to-end data science lifecycle - systemds/src/assembly/bin.xml at main · apache/systemds Thanks, Janardhan On Tue, Jun 21, 2022 at 11:40 PM arnab phani <phaniar...@gmail.com> wrote: > > I thought, we only include the libraries from SystemDS binary in the python > package. If so, then hadoop-* libraries are not new additions. > Unfortunately, test.pypi doesn't allow packages of more than 100MB, which > means we won't be able to dry run our python releases. > I would be a little more comfortable with a better explanation for why the > python package size increased by 2x from the last release. > > Regards, > Arnab.. > > On Tue, Jun 21, 2022 at 6:55 PM Janardhan <janard...@apache.org> wrote: > > > Hi, > > > > PyPi packages are a little more than 100MB. Compared 2.2.1 which is ~56 MB. > > > > -- Added in the present release (library sizes after unzip) > > > > 70K Jun 21 15:08 commons-compiler-3.0.16.jar > > 601K Jun 21 15:08 commons-compress-1.19.jar > > > > 193K Jun 21 15:08 commons-text-1.6.jar > > > > 19M Jun 21 15:08 hadoop-client-api-3.3.1.jar > > 31M Jun 21 15:08 hadoop-client-runtime-3.3.1.jar > > > > 5.3M Jun 21 15:08 hadoop-hdfs-client-3.3.1.jar > > > > 1.5M Jun 21 15:08 htrace-core4-4.1.0-incubating.jar > > > > 126K Jun 21 15:08 re2j-1.1.jar > > > > 192K Jun 21 15:08 stax2-api-4.2.1.jar > > 511K Jun 21 15:08 woodstox-core-5.3.0.jar > > > > Let us see if there is some optimization we can do? > > > > Best, > > Janardhan > >