That is a good point. The problem is relevant even for our binary release as we package the same set of libraries for the binary and python packages. For now, I suggest, we include all the packages from bin.xml. For the next release, we need to do systematic experiments to remove unnecessary libraries from bin/python.
Regards, Arnab.. On Thu, Jun 23, 2022 at 5:50 PM Baunsgaard, Sebastian <baunsga...@tugraz.at.invalid> wrote: > Hi, > > To verify this we have our current "no environment test" in python this > verify what packages are needed vs not. > > Unfortunately right now our main branch fails because of missing packages. > So i am making it bigger > Ideally we would not need to pack any of the hadoop things into the python > package. > Currently the system require hadoop jars because we import hadoop packages > many places in our code base where it could potentially be avoided. > > best regards > Sebastian > > ________________________________ > From: Janardhan <janard...@apache.org> > Sent: Thursday, June 23, 2022 5:14:24 PM > To: dev@systemds.apache.org > Subject: Re: [DISCUSS] PyPi packages are more than 100 MB. > > Hi team, > > In the list attached before, the following > > 19MB - hadoop-client-api-3.3.1.jar [1] > 31MB - hadoop-client-runtime-3.3.1.jar [2] > > are added, which are introduced in the Hadoop 3.x. > > These jars are added to the bin packaging, with > `<include>*:hadoop-client*</include>`[3] line the bin.xml. It has not > changed recently. > > Are these libraries intentional and important for binary release. Is > it possible to remove them? > > > [1] https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api > [2] > https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime > [3] > https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100 > [ > https://opengraph.githubassets.com/4346ffcbfafaa80de9f253ffae4064695cb243ef464a0b841cc5e00ee05f127b/apache/systemds > ]<https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100> > > systemds/src/assembly/bin.xml at main · apache/systemds · GitHub< > https://github.com/apache/systemds/blame/main/src/assembly/bin.xml#L100> > github.com > An open source ML system for the end-to-end data science lifecycle - > systemds/src/assembly/bin.xml at main · apache/systemds > > > > Thanks, > Janardhan > > > > On Tue, Jun 21, 2022 at 11:40 PM arnab phani <phaniar...@gmail.com> wrote: > > > > I thought, we only include the libraries from SystemDS binary in the > python > > package. If so, then hadoop-* libraries are not new additions. > > Unfortunately, test.pypi doesn't allow packages of more than 100MB, which > > means we won't be able to dry run our python releases. > > I would be a little more comfortable with a better explanation for why > the > > python package size increased by 2x from the last release. > > > > Regards, > > Arnab.. > > > > On Tue, Jun 21, 2022 at 6:55 PM Janardhan <janard...@apache.org> wrote: > > > > > Hi, > > > > > > PyPi packages are a little more than 100MB. Compared 2.2.1 which is > ~56 MB. > > > > > > -- Added in the present release (library sizes after unzip) > > > > > > 70K Jun 21 15:08 commons-compiler-3.0.16.jar > > > 601K Jun 21 15:08 commons-compress-1.19.jar > > > > > > 193K Jun 21 15:08 commons-text-1.6.jar > > > > > > 19M Jun 21 15:08 hadoop-client-api-3.3.1.jar > > > 31M Jun 21 15:08 hadoop-client-runtime-3.3.1.jar > > > > > > 5.3M Jun 21 15:08 hadoop-hdfs-client-3.3.1.jar > > > > > > 1.5M Jun 21 15:08 htrace-core4-4.1.0-incubating.jar > > > > > > 126K Jun 21 15:08 re2j-1.1.jar > > > > > > 192K Jun 21 15:08 stax2-api-4.2.1.jar > > > 511K Jun 21 15:08 woodstox-core-5.3.0.jar > > > > > > Let us see if there is some optimization we can do? > > > > > > Best, > > > Janardhan > > > >