>From the replies here, and looking at the major cloud providers, I don't see any concerns regarding moving the lower bound to Hadoop 3.3.x. As suggested on the issue, it would be good <https://github.com/apache/parquet-java/issues/2943> to first get rid of the Hadoop 2 profile and all the error-prone reflections.
Thanks everyone! Kind regards, Fokko Op ma 11 nov 2024 om 17:23 schreef Steve Loughran <ste...@cloudera.com.invalid>: > That's about what I expected > > HD/Insight it's probably a fork of Hadoop 3.1.x that is kept up-to-date > -certainly they do almost all of the work on the abfs connector against > trunk, with backport to the 3.4 branch, while AWS developers are > contributing great stuff in the S3A codebase (while I get left with the > mundane stuff like libraries forgetting to close streams ( > https://github.com/apache/hadoop/pull/7151), > > Cloudera code is itself a 3.1.x fork but is more up to date w.r.t java 11 > and CVEs; ~everything on hadoop branch-3.4 for s3a and abfs is in, and ~all > internal changes go into apache trunk and branch-3.4 first. That's not just > "community spirit" –microsoft, amazon, cloudera and may others sharing a > common codebase means that we all benefit from the broader test coverage, > especially of those "so rare you will never see them" failure conditions > which actually happen a few times a day across the entire user bases of > everyone's products (e.g. HADOOP-19221). Having parquet on 3.3.0+ means > that everyone will be using up-to-date code meaning problems which surface > testshould be replicable in your own IDEs and tests. > > Steve > > * more testing is always welcome, especially: third-party stores, long and > slow haul links, proxies, VPNs, customer supplied encryption keys, heavy > load -and more. It's those configurations which neither developers nor the > CI builds test which can always benefit from extra coverage. And tests > *through* parquet are the way to be sure that parquet's code isn't hitting > regressions. > > > On Thu, 7 Nov 2024 at 19:36, Fokko Driesprong <fo...@apache.org> wrote: > > > Thanks for jumping in here Steve, > > > > I agree with you, my only concern is that this is quite a jump. However, > > looking at the ecosystem, it might not be such a problem. Looking at the > > cloud providers: > > > > AWS active EMR distributions: > > > > 1. EMR 7.3.0 > > < > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-730-release.html > > > > is > > at Hadoop 3.3.6 > > 2. EMR 6.15 > > < > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html> > > is > > at Hadoop 3.3.6 (<6.6.x is EOL > > < > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy > > > > > ) > > 3. EMR 5.36 > > < > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-5362-release.html> > > is at Hadoop 2.10.1 (≤5.35 is EOL > > < > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-standard-support.html#emr-stadard-support-policy > > >, > > so only bugfixes for 5.36.x) > > > > GCP active DataProc distributions: > > > > - Dataproc 2.2.x > > < > > > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2 > > > > > is > > at Hadoop 3.3.6 > > - Dataproc 2.1.x > > < > > > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.1 > > > > > is > > at Hadoop 3.3.6 > > - Dataproc 2.0.x > > < > > > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.0 > > > > > is at Hadoop 3.2.4 (EOL 2024/07/31 > > < > > > https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters > > > > > ) > > > > Azure active HDI distributions: > > > > - HDInsight 5.x > > < > > > https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-5x-component-versioning > > >is > > at Hadoop 3.3.4 > > - HDInsight 4.0 > > < > > > https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-40-component-versioning > > > > > is at Hadoop 3.1.1 (they call out: certain HDInsight 4.0 cluster types > > that > > have retired or will be retiring soon). > > > > Or, query engines: > > > > - Spark 3.5.3 > > < > > > https://github.com/apache/spark/blob/d39f5ab99f67ce959b4379ecc3d6e262c10146cf/pom.xml#L125 > > > > > is at Hadoop 3.3.4 > > - Spark 3.4.4 > > < > > > https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/pom.xml#L122 > > > > > is at Hadoop 3.3.4 > > > > Hive 3.x has also been marked as EOL since October > > <https://hive.apache.org/general/downloads/>, and Hive 4 is also at > Hadoop > > 3.3.6 > > < > > > https://github.com/apache/hive/blob/c29bab6ff780e6d1cea74e995a50528364ae383a/pom.xml#L143 > > > > > . > > > > Looking at where the ecosystem is, jumping to Hadoop 3.3.x seems > reasonable > > to me. They can still use 1.14.x if they are on an older Hadoop version. > > > > Kind regards, > > Fokko > > > > > > > > Op do 7 nov 2024 om 16:16 schreef Steve Loughran > > <ste...@cloudera.com.invalid>: > > > > > On Mon, 4 Nov 2024 at 09:02, Fokko Driesprong <fo...@apache.org> > wrote: > > > > > > > Hi everyone, > > > > > > > > Breaking the radio silence from my end, I was enjoying paternity > leave. > > > > > > > > I wanted to bring this up for a while. In Parquet we're still > > supporting > > > > Hadoop 2.7.3, which was released in August 2016 > > > > <https://hadoop.apache.org/release/2.7.3.html>. For things like > JDK21 > > > > support, we have to drop these old versions. I was curious about what > > > > everyone thinks as a reasonable lower bound. > > > > > > > > My suggested route is to bump it to Hadoop 2.9.3 > > > > <https://github.com/apache/parquet-java/pull/2944/> (November 2019) > > for > > > > Parquet 1.15.0, and then drop Hadoop 2 in the major release after > that. > > > Any > > > > thoughts, questions or concerns? > > > > > > > > I'd be ruthless and say hadoop 3.3.x only. > > > > > > hadoop 2.x is nominally "java 7" only. really. > > > > > > hadoop 3.3.x is java8, but you really need to be on hadoop 3.4.x to > get a > > > set of dependencies which work OK with java 17+. > > > > > > Staying with older releases hampers parquet in terms of testing, > > > maintenance, inability to use improvements written in the past five or > > more > > > years, and more > > > > > > My proposal would be > > > > > > - 1.14.x: move to 2.9.3 > > > - 1.15.x hadoop 3.3.x only > > > > > >