Re: [DISCUSS] Hadoop support

Fokko Driesprong Wed, 13 Nov 2024 04:47:52 -0800

>From the replies here, and looking at the major cloud providers, I don't
see any concerns regarding moving the lower bound to Hadoop 3.3.x. As
suggested on the issue, it would be good
<https://github.com/apache/parquet-java/issues/2943> to first get rid of
the Hadoop 2 profile and all the error-prone reflections.


Thanks everyone!

Kind regards,
Fokko



Op ma 11 nov 2024 om 17:23 schreef Steve Loughran
<ste...@cloudera.com.invalid>:

> That's about what I expected
>
> HD/Insight it's probably a fork of Hadoop 3.1.x that is kept up-to-date
> -certainly  they do almost all of the work on the abfs connector against
> trunk, with backport to the 3.4 branch, while AWS developers are
> contributing great stuff in the S3A codebase (while I get left with the
> mundane stuff like libraries forgetting to close streams (
> https://github.com/apache/hadoop/pull/7151),
>
> Cloudera code is itself a 3.1.x fork but is more up to date w.r.t java 11
> and CVEs; ~everything on hadoop branch-3.4 for s3a and abfs is in, and ~all
> internal changes go into apache trunk and branch-3.4 first. That's not just
> "community spirit"  –microsoft, amazon, cloudera and may others sharing a
> common codebase means that we all benefit from the broader test coverage,
> especially of those "so rare you will never see them" failure conditions
> which actually happen a few times a day across the entire user bases of
> everyone's products (e.g. HADOOP-19221). Having parquet on 3.3.0+ means
> that everyone will be using up-to-date code meaning problems which surface
> testshould be replicable in your own IDEs and tests.
>
> Steve
>
> * more testing is always welcome, especially: third-party stores, long and
> slow haul links, proxies, VPNs, customer supplied encryption keys, heavy
> load -and more. It's those configurations which neither developers nor the
> CI builds test which can always benefit from extra coverage. And tests
> *through* parquet are the way to be sure that parquet's code isn't hitting
> regressions.
>
>
> On Thu, 7 Nov 2024 at 19:36, Fokko Driesprong <fo...@apache.org> wrote:
>
> > Thanks for jumping in here Steve,
> >
> > I agree with you, my only concern is that this is quite a jump. However,
> > looking at the ecosystem, it might not be such a problem. Looking at the
> > cloud providers:
> >
> > AWS active EMR distributions:
> >
> >    1. EMR 7.3.0
> >    <
> > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html
> >
> > is
> >    at Hadoop 3.3.6
> >    2. EMR 6.15
> >    <
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html>
> > is
> >    at Hadoop 3.3.6 (<6.6.x is EOL
> >    <
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> > >
> >    )
> >    3. EMR 5.36
> >    <
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html>
> >    is at Hadoop 2.10.1 (≤5.35 is EOL
> >    <
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> > >,
> >    so only bugfixes for 5.36.x)
> >
> > GCP active DataProc distributions:
> >
> >    - Dataproc 2.2.x
> >    <
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2
> > >
> > is
> >    at Hadoop 3.3.6
> >    - Dataproc 2.1.x
> >    <
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1
> > >
> > is
> >    at Hadoop 3.3.6
> >    - Dataproc 2.0.x
> >    <
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0
> > >
> >    is at Hadoop 3.2.4 (EOL 2024/07/31
> >    <
> >
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters
> > >
> >    )
> >
> > Azure active HDI distributions:
> >
> >    - HDInsight 5.x
> >    <
> >
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning
> > >is
> >    at Hadoop 3.3.4
> >    - HDInsight 4.0
> >    <
> >
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning
> > >
> >    is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster types
> > that
> >    have retired or will be retiring soon).
> >
> > Or, query engines:
> >
> >    - Spark 3.5.3
> >    <
> >
> https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125
> > >
> >    is at Hadoop 3.3.4
> >    - Spark 3.4.4
> >    <
> >
> https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122
> > >
> >    is at Hadoop 3.3.4
> >
> > Hive 3.x has also been marked as EOL since October
> > <https://hive.apache.org/general/downloads/>, and Hive 4 is also at
> Hadoop
> > 3.3.6
> > <
> >
> https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143
> > >
> > .
> >
> > Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems
> reasonable
> > to me. They can still use 1.14.x if they are on an older Hadoop version.
> >
> > Kind regards,
> > Fokko
> >
> >
> >
> > Op do 7 nov 2024 om 16:16 schreef Steve Loughran
> > <ste...@cloudera.com.invalid>:
> >
> > > On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <fo...@apache.org>
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Breaking the radio silence from my end, I was enjoying paternity
> leave.
> > > >
> > > > I wanted to bring this up for a while. In Parquet we're still
> > supporting
> > > > Hadoop 2.7.3, which was released in August 2016
> > > > <https://hadoop.apache.org/release/2.7.3.html>. For things like
> JDK21
> > > > support, we have to drop these old versions. I was curious about what
> > > > everyone thinks as a reasonable lower bound.
> > > >
> > > > My suggested route is to bump it to Hadoop 2.9.3
> > > > <https://github.com/apache/parquet-java/pull/2944/> (November 2019)
> > for
> > > > Parquet 1.15.0, and then drop Hadoop 2 in the major release after
> that.
> > > Any
> > > > thoughts, questions or concerns?
> > > >
> > > > I'd be ruthless and say hadoop 3.3.x only.
> > >
> > > hadoop 2.x is nominally "java 7" only. really.
> > >
> > > hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to
> get a
> > > set of dependencies which work OK with java 17+.
> > >
> > > Staying with older releases hampers parquet in terms of testing,
> > > maintenance, inability to use improvements written in the past five or
> > more
> > > years, and more
> > >
> > > My proposal would be
> > >
> > >    - 1.14.x: move to 2.9.3
> > >    - 1.15.x hadoop 3.3.x only
> > >
> >
>

Re: [DISCUSS] Hadoop support

Reply via email to