Yik San Chan created FLINK-22519:
------------------------------------
Summary: Have python-archives also take tar.gz
Key: FLINK-22519
URL: https://issues.apache.org/jira/browse/FLINK-22519
Project: Flink
Issue Type: New Feature
Components: API / Python
Reporter: Yik San Chan
[python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives]
currently only takes zip.
In our use case, we want to package the whole conda environment into
python-archives, similar to how the
[docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster]
suggest about using venv (Python virtual environment). As we use PyFlink for
ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow),
as well as a lot of small dependencies.
This pattern is not friendly for zip. According to the
[post|https://superuser.com/a/173825], zip compresses each file independently,
and it is not performing good when dealing with a lot of small files. On the
other hand, tar simply bundles all files into a tarball, then we can apply gzip
to the whole tarball to achieve smaller size. This may explain why the official
packaging tool - conda pack - [conda pack|https://conda.github.io/conda-pack/]
- produces tar.gz by default, even though zip is an option if we really want to.
To further prove the idea, I use my laptop and conda env to run an experiment.
My OS: macOS 10.15.7
# Create an environment.yaml as well as a requirements.txt
# Run `conda env create -f environment.yaml` to create the conda env
# Run conda pack to produce a tar.gz
# Run conda pack faetflow-ml-env.zip to produce a zip
# environment.yaml
name: featflow-ml-env
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.7
- pytorch=1.8.0
- scikit-learn=0.23.2
- pip
- pip:
- -r file:requirements.txt
#requirements.txt
apache-flink==1.12.0
deepctr-torch==0.2.6
black==20.8b1
confluent-kafka==1.6.0
pytest==6.2.2
testcontainers==3.4.0
kafka-python==2.0.2
End result: the tar.gz is 854M, the zip is 1.6G
So, long story short, python-archives only support zip, while zip is not a good
choice for packaging ML libs. Let's change this by adding python-archives
tar.gz support.
Change will happen in this way: In ProcessPythonEnvironmentManager.java, check
the suffix. If tar.gz, unarchive it using gzip decompresser.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)