That's about what I expected HD/Insight it's probably a fork of Hadoop 3.1.x that is kept up-to-date -certainly they do almost all of the work on the abfs connector against trunk, with backport to the 3.4 branch, while AWS developers are contributing great stuff in the S3A codebase (while I get left with the mundane stuff like libraries forgetting to close streams ( https://github.com/apache/hadoop/pull/7151),
Cloudera code is itself a 3.1.x fork but is more up to date w.r.t java 11 and CVEs; ~everything on hadoop branch-3.4 for s3a and abfs is in, and ~all internal changes go into apache trunk and branch-3.4 first. That's not just "community spirit" –microsoft, amazon, cloudera and may others sharing a common codebase means that we all benefit from the broader test coverage, especially of those "so rare you will never see them" failure conditions which actually happen a few times a day across the entire user bases of everyone's products (e.g. HADOOP-19221). Having parquet on 3.3.0+ means that everyone will be using up-to-date code meaning problems which surface testshould be replicable in your own IDEs and tests. Steve * more testing is always welcome, especially: third-party stores, long and slow haul links, proxies, VPNs, customer supplied encryption keys, heavy load -and more. It's those configurations which neither developers nor the CI builds test which can always benefit from extra coverage. And tests *through* parquet are the way to be sure that parquet's code isn't hitting regressions. On Thu, 7 Nov 2024 at 19:36, Fokko Driesprong <fo...@apache.org> wrote: > Thanks for jumping in here Steve, > > I agree with you, my only concern is that this is quite a jump. However, > looking at the ecosystem, it might not be such a problem. Looking at the > cloud providers: > > AWS active EMR distributions: > > 1. EMR 7.3.0 > < > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html> > is > at Hadoop 3.3.6 > 2. EMR 6.15 > < > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html> > is > at Hadoop 3.3.6 (<6.6.x is EOL > < > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy > > > ) > 3. EMR 5.36 > < > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html> > is at Hadoop 2.10.1 (≤5.35 is EOL > < > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy > >, > so only bugfixes for 5.36.x) > > GCP active DataProc distributions: > > - Dataproc 2.2.x > < > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2 > > > is > at Hadoop 3.3.6 > - Dataproc 2.1.x > < > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1 > > > is > at Hadoop 3.3.6 > - Dataproc 2.0.x > < > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0 > > > is at Hadoop 3.2.4 (EOL 2024/07/31 > < > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters > > > ) > > Azure active HDI distributions: > > - HDInsight 5.x > < > https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning > >is > at Hadoop 3.3.4 > - HDInsight 4.0 > < > https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning > > > is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster types > that > have retired or will be retiring soon). > > Or, query engines: > > - Spark 3.5.3 > < > https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125 > > > is at Hadoop 3.3.4 > - Spark 3.4.4 > < > https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122 > > > is at Hadoop 3.3.4 > > Hive 3.x has also been marked as EOL since October > <https://hive.apache.org/general/downloads/>, and Hive 4 is also at Hadoop > 3.3.6 > < > https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143 > > > . > > Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems reasonable > to me. They can still use 1.14.x if they are on an older Hadoop version. > > Kind regards, > Fokko > > > > Op do 7 nov 2024 om 16:16 schreef Steve Loughran > <ste...@cloudera.com.invalid>: > > > On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <fo...@apache.org> wrote: > > > > > Hi everyone, > > > > > > Breaking the radio silence from my end, I was enjoying paternity leave. > > > > > > I wanted to bring this up for a while. In Parquet we're still > supporting > > > Hadoop 2.7.3, which was released in August 2016 > > > <https://hadoop.apache.org/release/2.7.3.html>. For things like JDK21 > > > support, we have to drop these old versions. I was curious about what > > > everyone thinks as a reasonable lower bound. > > > > > > My suggested route is to bump it to Hadoop 2.9.3 > > > <https://github.com/apache/parquet-java/pull/2944/> (November 2019) > for > > > Parquet 1.15.0, and then drop Hadoop 2 in the major release after that. > > Any > > > thoughts, questions or concerns? > > > > > > I'd be ruthless and say hadoop 3.3.x only. > > > > hadoop 2.x is nominally "java 7" only. really. > > > > hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to get a > > set of dependencies which work OK with java 17+. > > > > Staying with older releases hampers parquet in terms of testing, > > maintenance, inability to use improvements written in the past five or > more > > years, and more > > > > My proposal would be > > > > - 1.14.x: move to 2.9.3 > > - 1.15.x hadoop 3.3.x only > > >