Re: [DISCUSS] Hadoop support

Fokko Driesprong Thu, 07 Nov 2024 11:36:46 -0800

Thanks for jumping in here Steve,

I agree with you, my only concern is that this is quite a jump. However,
looking at the ecosystem, it might not be such a problem. Looking at the
cloud providers:


AWS active EMR distributions:

   1. EMR 7.3.0
   <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html>
is
   at Hadoop 3.3.6
   2. EMR 6.15
   <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html>
is
   at Hadoop 3.3.6 (<6.6.x is EOL
   
<https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy>
   )
   3. EMR 5.36
   <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html>
   is at Hadoop 2.10.1 (≤5.35 is EOL
   
<https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy>,
   so only bugfixes for 5.36.x)

GCP active DataProc distributions:

   - Dataproc 2.2.x
   
<https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2>
is
   at Hadoop 3.3.6
   - Dataproc 2.1.x
   
<https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1>
is
   at Hadoop 3.3.6
   - Dataproc 2.0.x
   
<https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0>
   is at Hadoop 3.2.4 (EOL 2024/07/31
   
<https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters>
   )

Azure active HDI distributions:

   - HDInsight 5.x
   
<https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning>is
   at Hadoop 3.3.4
   - HDInsight 4.0
   
<https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning>
   is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster types that
   have retired or will be retiring soon).

Or, query engines:

   - Spark 3.5.3
   
<https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125>
   is at Hadoop 3.3.4
   - Spark 3.4.4
   
<https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122>
   is at Hadoop 3.3.4

Hive 3.x has also been marked as EOL since October
<https://hive.apache.org/general/downloads/>, and Hive 4 is also at Hadoop
3.3.6
<https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143>
.

Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems reasonable
to me. They can still use 1.14.x if they are on an older Hadoop version.

Kind regards,
Fokko



Op do 7 nov 2024 om 16:16 schreef Steve Loughran
<ste...@cloudera.com.invalid>:

> On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <fo...@apache.org> wrote:
>
> > Hi everyone,
> >
> > Breaking the radio silence from my end, I was enjoying paternity leave.
> >
> > I wanted to bring this up for a while. In Parquet we're still supporting
> > Hadoop 2.7.3, which was released in August 2016
> > <https://hadoop.apache.org/release/2.7.3.html>. For things like JDK21
> > support, we have to drop these old versions. I was curious about what
> > everyone thinks as a reasonable lower bound.
> >
> > My suggested route is to bump it to Hadoop 2.9.3
> > <https://github.com/apache/parquet-java/pull/2944/> (November 2019) for
> > Parquet 1.15.0, and then drop Hadoop 2 in the major release after that.
> Any
> > thoughts, questions or concerns?
> >
> > I'd be ruthless and say hadoop 3.3.x only.
>
> hadoop 2.x is nominally "java 7" only. really.
>
> hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to get a
> set of dependencies which work OK with java 17+.
>
> Staying with older releases hampers parquet in terms of testing,
> maintenance, inability to use improvements written in the past five or more
> years, and more
>
> My proposal would be
>
>    - 1.14.x: move to 2.9.3
>    - 1.15.x hadoop 3.3.x only
>

Re: [DISCUSS] Hadoop support

Reply via email to