HyukjinKwon commented on pull request #29703:
URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353


   > 1. How does this interact with the ability to specify dependencies in a 
requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 
pip install -r requirements.txt, because, kinda like with --install-option, 
we've now modified pip's behavior across all the libraries it's going to 
install.
   >
   >    I also wonder if this plays well with tools like pip-tools that compile 
down requirements files into the full list of their transitive dependencies. 
I'm guessing users will need to manually preserve the environment variables, 
because they will not be reflected in the compiled requirements.
   
   I agree that it doesn't look very pip friendly. That's why I had to 
investigate a lot and write down what I checked in the PR description. 
   
   `--instal-option` is supported via `requirement.txt` so once pip provides a 
proper way to configure this, we will switch to this (at SPARK-32837). We can't 
use this option for now due to https://github.com/pypa/pip/issues/1883. There 
seems no other ways possible given my investigation.
   
   We can just keep this as an experimental mode for the time being in this 
way, and switch it to the proper pip installation option once they support in 
the future.
   
   > 2. Have you considered publishing these alternate builds under different 
package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with 
environment variables, and delivers a more vanilla install experience. But it 
will also push us to define upfront what combinations to publish builds for to 
PyPI.
   
   I have thought about this option too but ..
   - I think we'll end up with having multiple packages per the profiles we 
support.
   - I still think using pip's native configuration is the ideal way. By using 
environment variables, we can easily switch it to use pip's option in the 
future.
   - Minor but .. It will be difficult to track the usage 
(https://pypistats.org/packages/pyspark)
   
   > 3. Are you sure it's OK to point at archive.apache.org? Everyone 
installing a non-current version of PySpark with alternate versions of Hadoop / 
Hive specified will hit the archive. Unlike PyPI, the Apache archive is not 
backed by a generous CDN:
   >
   >     Do note that a daily limit of 5GB per IP is being enforced on 
archive.apache.org, to prevent abuse.
   >
   >     In Flintrock, I never touch the archive out of fear of being an 
"abusive user". This is another argument for publishing alternate packages to 
PyPI.
   
   Yeah, I understand this can be a valid concern. But this is already 
available to use and people use it. Also it's being used in our own CI:
   
   
https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82
   
   The PR makes it easier to use them to download old versions. We can make it 
configurable as well via exposing an environment variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to