Re: [DISCUSS] Hadoop support

Steve Loughran Mon, 11 Nov 2024 08:24:04 -0800

That's about what I expected

HD/Insight it's probably a fork of Hadoop 3.1.x that is kept up-to-date
-certainly  they do almost all of the work on the abfs connector against
trunk, with backport to the 3.4 branch, while AWS developers are
contributing great stuff in the S3A codebase (while I get left with the
mundane stuff like libraries forgetting to close streams (
https://github.com/apache/hadoop/pull/7151),


Cloudera code is itself a 3.1.x fork but is more up to date w.r.t java 11
and CVEs; ~everything on hadoop branch-3.4 for s3a and abfs is in, and ~all
internal changes go into apache trunk and branch-3.4 first. That's not just
"community spirit"  –microsoft, amazon, cloudera and may others sharing a
common codebase means that we all benefit from the broader test coverage,
especially of those "so rare you will never see them" failure conditions
which actually happen a few times a day across the entire user bases of
everyone's products (e.g. HADOOP-19221). Having parquet on 3.3.0+ means
that everyone will be using up-to-date code meaning problems which surface
testshould be replicable in your own IDEs and tests.

Steve

* more testing is always welcome, especially: third-party stores, long and
slow haul links, proxies, VPNs, customer supplied encryption keys, heavy
load -and more. It's those configurations which neither developers nor the
CI builds test which can always benefit from extra coverage. And tests
*through* parquet are the way to be sure that parquet's code isn't hitting
regressions.


On Thu, 7 Nov 2024 at 19:36, Fokko Driesprong <fo...@apache.org> wrote:

> Thanks for jumping in here Steve,
>
> I agree with you, my only concern is that this is quite a jump. However,
> looking at the ecosystem, it might not be such a problem. Looking at the
> cloud providers:
>
> AWS active EMR distributions:
>
>    1. EMR 7.3.0
>    <
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html>
> is
>    at Hadoop 3.3.6
>    2. EMR 6.15
>    <
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html>
> is
>    at Hadoop 3.3.6 (<6.6.x is EOL
>    <
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> >
>    )
>    3. EMR 5.36
>    <
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html>
>    is at Hadoop 2.10.1 (≤5.35 is EOL
>    <
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy
> >,
>    so only bugfixes for 5.36.x)
>
> GCP active DataProc distributions:
>
>    - Dataproc 2.2.x
>    <
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2
> >
> is
>    at Hadoop 3.3.6
>    - Dataproc 2.1.x
>    <
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1
> >
> is
>    at Hadoop 3.3.6
>    - Dataproc 2.0.x
>    <
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0
> >
>    is at Hadoop 3.2.4 (EOL 2024/07/31
>    <
> https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters
> >
>    )
>
> Azure active HDI distributions:
>
>    - HDInsight 5.x
>    <
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning
> >is
>    at Hadoop 3.3.4
>    - HDInsight 4.0
>    <
> https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning
> >
>    is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster types
> that
>    have retired or will be retiring soon).
>
> Or, query engines:
>
>    - Spark 3.5.3
>    <
> https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125
> >
>    is at Hadoop 3.3.4
>    - Spark 3.4.4
>    <
> https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122
> >
>    is at Hadoop 3.3.4
>
> Hive 3.x has also been marked as EOL since October
> <https://hive.apache.org/general/downloads/>, and Hive 4 is also at Hadoop
> 3.3.6
> <
> https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143
> >
> .
>
> Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems reasonable
> to me. They can still use 1.14.x if they are on an older Hadoop version.
>
> Kind regards,
> Fokko
>
>
>
> Op do 7 nov 2024 om 16:16 schreef Steve Loughran
> <ste...@cloudera.com.invalid>:
>
> > On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <fo...@apache.org> wrote:
> >
> > > Hi everyone,
> > >
> > > Breaking the radio silence from my end, I was enjoying paternity leave.
> > >
> > > I wanted to bring this up for a while. In Parquet we're still
> supporting
> > > Hadoop 2.7.3, which was released in August 2016
> > > <https://hadoop.apache.org/release/2.7.3.html>. For things like JDK21
> > > support, we have to drop these old versions. I was curious about what
> > > everyone thinks as a reasonable lower bound.
> > >
> > > My suggested route is to bump it to Hadoop 2.9.3
> > > <https://github.com/apache/parquet-java/pull/2944/> (November 2019)
> for
> > > Parquet 1.15.0, and then drop Hadoop 2 in the major release after that.
> > Any
> > > thoughts, questions or concerns?
> > >
> > > I'd be ruthless and say hadoop 3.3.x only.
> >
> > hadoop 2.x is nominally "java 7" only. really.
> >
> > hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to get a
> > set of dependencies which work OK with java 17+.
> >
> > Staying with older releases hampers parquet in terms of testing,
> > maintenance, inability to use improvements written in the past five or
> more
> > years, and more
> >
> > My proposal would be
> >
> >    - 1.14.x: move to 2.9.3
> >    - 1.15.x hadoop 3.3.x only
> >
>

Re: [DISCUSS] Hadoop support

Reply via email to