This is an automated email from the ASF dual-hosted git repository.
jmalkin pushed a commit to branch python
in repository https://gitbox.apache.org/repos/asf/datasketches-spark.git
The following commit(s) were added to refs/heads/python by this push:
new 6357cdd Update readmes with build instructions
6357cdd is described below
commit 6357cdd44ca4450f65d4ac47321a53042462a2cc
Author: Jon Malkin <[email protected]>
AuthorDate: Tue Feb 18 13:34:05 2025 -0800
Update readmes with build instructions
---
README.md | 29 +++++++++++++++++++++++++++--
python/README.md | 31 +++++++++++++++++++++++++++++++
2 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index c0749a4..8be7310 100644
--- a/README.md
+++ b/README.md
@@ -21,5 +21,30 @@
This repo is still an early-stage work in progress.
-There have been multiple attempts to help integrate Apache DataSketches into
Apache Spark, including one built into Spark itself as of v3.5. All are useful
work, but in comparing them, there are various limitations to each library.
Whether limiting the type of sketches available
-(e.g. native Spark provides only HLL) or limiting flexibility and
functionality (e.g. forcing HLL and Theta to use a common interface which
precludes set operations HLL cannot support, or using global parameters to
control the sizes of all sketch instances in the query), the other libraries
place undesirable constraints on developers looking to use sketches in their
queries or data systems. This library aims to restore that choice to develoeprs.
+There have been multiple attempts to help integrate Apache DataSketches
+into Apache Spark, including one built into Spark itself as of v3.5.
+All are useful work, but in comparing them, there are various limitations
+to each library. Whether limiting the type of sketches available
+(e.g. native Spark provides only HLL) or limiting flexibility and
+functionality (e.g. forcing HLL and Theta to use a common interface which
+precludes set operations HLL cannot support, or using global parameters
+to control the sizes of all sketch instances in the query), the other
+libraries place undesirable constraints on developers looking to use
+sketches in their queries or data systems. This library aims to restore
+that choice to develoeprs.
+
+## Build and Test Instructions
+
+Building the library requires `sbt`, a commonly used build
+system for Scala projects. There are several environment variables
+that can be used to configure the project:
+
+* Java version, typically via `$JAVA_HOME`: Default is 11
+* `$SCALA_VERSION`: Default is 2.12.20
+* `$SPARK_VERSION`: Default is 3.5.4
+
+The package is built using `sbt package` and tests are
+run with `sbt test`.
+
+If building for the pyspark package, we currently recommend using
+Java 11, even if the library will be used with later Java versions.
diff --git a/python/README.md b/python/README.md
index 7e7f2b9..370f799 100644
--- a/python/README.md
+++ b/python/README.md
@@ -22,3 +22,34 @@
This repo is still an early-stage work in progress.
This is the PySpark plugin component.
+
+## Usage
+
+There are several Spark config options needed to use the library.
+`tests/conftest.py` provides a basic example. The key settings to
+note are:
+
+* `.config("spark.driver.userClassPathFirst", "true")`
+* `.config("spark.executor.userClassPathFirst", "true")`
+* `.config("spark.driver.extraClassPath", get_dependency_classpath())`
+* `.config("spark.executor.extraClassPath", get_dependency_classpath())`
+
+Starting with Spark 3.5, the Spark includes an older version of the
DataSketches java library, so Spark needs to know to use the
+provided verison.
+
+Initial testing with Java 17 indicates that there may be
+additional configuration options needed to enable the
+use of MemorySegment. For now we suggest using a base library
+compiled for Java 11 with pyspark.
+
+## Build and Test Instructions
+
+This component requires that the Scala library is already built.
+The build process will check for the availability of the relevant
+jars and fail if they do not exist. It will also update the jars
+in the event the current python module's copies are older.
+
+The easiest way to build the library is with the `build` package:
+`python -m build --wheel`. The resulting wheel can then be installed with
`python -m pip install dist/datasketches_spark_<version-info>.whl`
+
+Tests are run with `pytest` or `tox`.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]