SemyonSinchenko commented on issue #1079:
URL:
https://github.com/apache/datafusion-comet/issues/1079#issuecomment-2504259760
We had a discussion with @MrPowers and that is the options we found:
# The problem
- All the "plugin" configs are static in Apache Spark and cannot be changed
after the job is running;
- Comet does not provide at the moment any additional functionality for end
users;
# Solution 1: fat python installation
In that case all the Comet's JARs are packed as resources for the python
package with Spark itself JARs. `spark-submit`, `pyspark`, `spark-shell`, etc.
are wrapped to run jobs on comet by default.
For example, `pip install pycomet[spark35]` will install:
- pyspark 3.5.x
- comet jars and resources
- wrapped scripts for `spark-submit`, `pyspark`, `spark-shell`
Under the hood, for example, `spark-submit` will add Comet JARs to the CP
and enable by default the plugin. Other parameters are passed to the
`spark-submit`.
# Solution 2: thin python installation
In that case `pycomet` is a tiny package that contains only the information
about Maven coordinates of Comet JARs and, for example, `comet-submit` and
`pycomet` commands that are again just a wrapper on top of spark's
`spark-submit`. In that case, `SPARK_HOME` should be specified. In the runtime,
`comet-submit` / `pycomet` will determine the version of the spark, add comet
JARs from Maven Central to the CP and run the subit command with specified
plugin.
# Solution 3: just a helper
In that case `pycomet` contains only Maven coordinates.
For example, it may be done in the same way like it is done for python-deequ:
```python
SPARK_TO_DEEQU_COORD_MAPPING = {
"3.5": "com.amazon.deequ:deequ:2.0.7-spark-3.5",
"3.3": "com.amazon.deequ:deequ:2.0.7-spark-3.3",
"3.2": "com.amazon.deequ:deequ:2.0.7-spark-3.2",
"3.1": "com.amazon.deequ:deequ:2.0.7-spark-3.1"
}
def _extract_major_minor_versions(full_version: str):
major_minor_pattern = re.compile(r"(\d+\.\d+)\.*")
match = re.match(major_minor_pattern, full_version)
if match:
return match.group(1)
@lru_cache(maxsize=None)
def _get_spark_version() -> str:
try:
spark_version = os.environ["SPARK_VERSION"]
except KeyError:
raise RuntimeError(f"SPARK_VERSION environment variable is required.
Supported values are: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}")
return _extract_major_minor_versions(spark_version)
def _get_deequ_maven_config():
spark_version = _get_spark_version()
try:
return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
except KeyError:
raise RuntimeError(
f"Found incompatible Spark version {spark_version}; Use one of
the Supported Spark versions for Deequ: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}"
)
```
# Solution 4: all together
- Running `pip install pycomet` will install only helpers and wrappers
(`pycomet`, `comet-submit`);
- Running `pip install pycomet[spark35]` will install helpers, wrappers and
also all the comet resources + pyspark itself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]