The team I'm on currently uses pip-installed PySpark for local development,
and we regularly access S3 directly from our laptops/workstations.

One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
being able to use a recent version of hadoop-aws that has mature support
for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
there are incompatibilities that prevent you from using Spark built against
Hadoop 2.7 with hadoop-aws version 2.8 or newer.

On Wed, Jun 24, 2020 at 10:15 AM Sean Owen <sro...@gmail.com> wrote:

> Will pyspark users care much about Hadoop version? they won't if running
> locally. They will if connecting to a Hadoop cluster. Then again in that
> context, they're probably using a distro anyway that harmonizes it.
> Hadoop 3's installed based can't be that large yet; it's been around far
> less time.
>
> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
> But if the question now is build defaults, is it a big deal either way?
>
> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li <lix...@databricks.com> wrote:
>
>> I think we just need to provide two options and let end users choose the
>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>> 3.1 release to me.
>>
>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>> on this link
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>
>>
>>

Reply via email to