HyukjinKwon commented on pull request #29703: URL: https://github.com/apache/spark/pull/29703#issuecomment-690820353
> 1. How does this interact with the ability to specify dependencies in a requirements file? It seems odd to have to do something like HADOOP_VERSION=3.2 pip install -r requirements.txt, because, kinda like with --install-option, we've now modified pip's behavior across all the libraries it's going to install. > > I also wonder if this plays well with tools like pip-tools that compile down requirements files into the full list of their transitive dependencies. I'm guessing users will need to manually preserve the environment variables, because they will not be reflected in the compiled requirements. I agree that it doesn't look very pip friendly. That's why I had to investigate a lot and write down what I checked in the PR description. `--instal-option` is supported via `requirement.txt` so once pip provides a proper way to configure this, we will switch to this (at SPARK-32837). We can't use this option for now due to https://github.com/pypa/pip/issues/1883. There seems no other ways possible given my investigation. We can just keep this as an experimental mode for the time being in this way, and switch it to the proper pip installation option once they support in the future. > 2. Have you considered publishing these alternate builds under different package names? e.g. pyspark-hadoop3.2. This avoids the need to mess with environment variables, and delivers a more vanilla install experience. But it will also push us to define upfront what combinations to publish builds for to PyPI. I have thought about this option too but .. - I think we'll end up with having multiple packages per the profiles we support. - I still think using pip's native configuration is the ideal way. By using environment variables, we can easily switch it to use pip's option in the future. - Minor but .. It will be difficult to track the usage (https://pypistats.org/packages/pyspark) > 3. Are you sure it's OK to point at archive.apache.org? Everyone installing a non-current version of PySpark with alternate versions of Hadoop / Hive specified will hit the archive. Unlike PyPI, the Apache archive is not backed by a generous CDN: > > Do note that a daily limit of 5GB per IP is being enforced on archive.apache.org, to prevent abuse. > > In Flintrock, I never touch the archive out of fear of being an "abusive user". This is another argument for publishing alternate packages to PyPI. Yeah, I understand this can be a valid concern. But this is already available to use and people use it. Also it's being used in our own CI: https://github.com/apache/spark/blob/b84ed4146d93b37adb2b83ca642c7978a1ac853e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L82 The PR makes it easier to use them to download old versions. We can make it configurable as well via exposing an environment variable. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org