Re: [I] Provide pip install for PySpark users [datafusion-comet]

via GitHub Wed, 27 Nov 2024 08:08:07 -0800


SemyonSinchenko commented on issue #1079:
URL: 
https://github.com/apache/datafusion-comet/issues/1079#issuecomment-2504259760


   We had a discussion with @MrPowers and that is the options we found:
   # The problem
   
   - All the "plugin" configs are static in Apache Spark and cannot be changed 
after the job is running;
   - Comet does not provide at the moment any additional functionality for end 
users;
   
   # Solution 1: fat python installation
   
   In that case all the Comet's JARs are packed as resources for the python 
package with Spark itself JARs. `spark-submit`, `pyspark`, `spark-shell`, etc. 
are wrapped to run jobs on comet by default.
   
   For example, `pip install pycomet[spark35]` will install:
   - pyspark 3.5.x
   - comet jars and resources
   - wrapped scripts for `spark-submit`, `pyspark`, `spark-shell`
   
   Under the hood, for example, `spark-submit` will add Comet JARs to the CP 
and enable by default the plugin. Other parameters are passed to the 
`spark-submit`.
   
   # Solution 2: thin python installation
   
   In that case `pycomet` is a tiny package that contains only the information 
about Maven coordinates of Comet JARs and, for example, `comet-submit` and 
`pycomet` commands that are again just a wrapper on top of spark's 
`spark-submit`. In that case, `SPARK_HOME` should be specified. In the runtime, 
`comet-submit` / `pycomet` will determine the version of the spark, add comet 
JARs from Maven Central to the CP and run the subit command with specified 
plugin.
   
   # Solution 3: just a helper
   
   In that case `pycomet` contains only Maven coordinates. 
   
   For example, it may be done in the same way like it is done for python-deequ:
   
   ```python
   SPARK_TO_DEEQU_COORD_MAPPING = {
       "3.5": "com.amazon.deequ:deequ:2.0.7-spark-3.5",
       "3.3": "com.amazon.deequ:deequ:2.0.7-spark-3.3",
       "3.2": "com.amazon.deequ:deequ:2.0.7-spark-3.2",
       "3.1": "com.amazon.deequ:deequ:2.0.7-spark-3.1"
   }
   
   
   def _extract_major_minor_versions(full_version: str):
       major_minor_pattern = re.compile(r"(\d+\.\d+)\.*")
       match = re.match(major_minor_pattern, full_version)
       if match:
           return match.group(1)
   
   
   @lru_cache(maxsize=None)
   def _get_spark_version() -> str:
       try:
           spark_version = os.environ["SPARK_VERSION"]
       except KeyError:
           raise RuntimeError(f"SPARK_VERSION environment variable is required. 
Supported values are: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}")
   
       return _extract_major_minor_versions(spark_version)
   
   
   def _get_deequ_maven_config():
       spark_version = _get_spark_version()
       try:
           return SPARK_TO_DEEQU_COORD_MAPPING[spark_version[:3]]
       except KeyError:
           raise RuntimeError(
               f"Found incompatible Spark version {spark_version}; Use one of 
the Supported Spark versions for Deequ: {SPARK_TO_DEEQU_COORD_MAPPING.keys()}"
           )
   ```
   
   # Solution 4: all together
   
   - Running `pip install pycomet` will install only helpers and wrappers 
(`pycomet`, `comet-submit`);
   - Running `pip install pycomet[spark35]` will install helpers, wrappers and 
also all the comet resources + pyspark itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Provide pip install for PySpark users [datafusion-comet]

Reply via email to