Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20089#discussion_r158796151
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace
all of the other use c
## Python Requirements
-At its core PySpark depends on Py4J (currently version 0.10.6), but
additional sub-packages have their own requirements (including numpy and
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but
additional sub-packages might have their own requirements declared as "Extras"
(including numpy, pandas, and pyarrow). You can install the requirements by
specifying their extra names.
--- End diff --
Ah, I see. How about simply:
```
At its core PySpark depends on Py4J (currently version 0.10.6), but some
additional sub-packages have their own
extra requirements for some features (including numpy, pandas, and pyarrow).
```
for now? I just noticed we are a bit unclear on this (e.g., actually I have
been under impression that NumPy is required for ML/MLlib so far) but I think
this roughly describes it correctly and is good enough.
Will maybe try to make a followup to fully describe it later. This PR
targets PyArrow anyway.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]