Twitter just led the release of Hadoop 2.6.5 precisely because they wanted to keep a Java 6 cluster up: the bigger your cluster, the less of a rush to upgrade.
HDP? I believe we install & prefer (openjdk) Java 8, but the Hadoop branch-2 line is intended to build/run on Java 7 too. There's always a conflict between us developers "shiny new features" and ops "keep cluster alive". That's actually where Scala has an edge: no need to upgrade the cluster-wide JVM just for an update, or play games configuring your deployed application to use a different JVM from the Hadoop services (which you can do, after all: it's just path setup). Thinking about it, knowing what can be done there —including documenting it in the spark docs, could be a good migration strategy. me? I look forward to when we can use Java 9 to isolate transitive dependencies; the bane of everyone's life. Someone needs to start on preparing everything for that to work though. On 28 Oct 2016, at 11:47, Chris Fregly <ch...@fregly.com<mailto:ch...@fregly.com>> wrote: i seem to remember a large spark user (tencent, i believe) chiming in late during these discussions 6-12 months ago and squashing any sort of deprecation given the massive effort that would be required to upgrade their environment. i just want to make sure these convos take into consideration large spark users - and reflect the real world versus ideal world. otherwise, this is all for naught like last time. On Oct 28, 2016, at 10:43 AM, Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>> wrote: If the subtext is vendors, then I'd have a look at what recent distros look like. I'll write about CDH as a representative example, but I think other distros are naturally similar. CDH has been on Java 8, Hadoop 2.6, Python 2.7 for almost two years (CDH 5.3 / Dec 2014). Granted, this depends on installing on an OS with that Java / Python version. But Java 8 / Python 2.7 is available for all of the supported OSes. The population that isn't on CDH 4, because that supported was dropped a long time ago in Spark, and who is on a version released 2-2.5 years ago, and won't update, is a couple percent of the installed base. They do not in general want anything to change at all. I assure everyone that vendors too are aligned in wanting to cater to the crowd that wants the most recent version of everything. For example, CDH offers both Spark 2.0.1 and 1.6 at the same time. I wouldn't dismiss support for these supporting components as a relevant proxy for whether they are worth supporting in Spark. Java 7 is long since EOL (no, I don't count paying Oracle for support). No vendor is supporting Hadoop < 2.6. Scala 2.10 was EOL at the end of 2014. Is there a criteria here that reaches a different conclusion about these things just for Spark? This was roughly the same conversation that happened 6 months ago. I imagine we're going to find that in about 6 months it'll make more sense all around to remove these. If we can just give a heads up with deprecation and then kick the can down the road a bit more, that sounds like enough for now. On Fri, Oct 28, 2016 at 8:58 AM Matei Zaharia <matei.zaha...@gmail.com<mailto:matei.zaha...@gmail.com>> wrote: Deprecating them is fine (and I know they're already deprecated), the question is just whether to remove them. For example, what exactly is the downside of having Python 2.6 or Java 7 right now? If it's high, then we can remove them, but I just haven't seen a ton of details. It also sounded like fairly recent versions of CDH, HDP, RHEL, etc still have old versions of these. Just talking with users, I've seen many of people who say "we have a Hadoop cluster from $VENDOR, but we just download Spark from Apache and run newer versions of that". That's great for Spark IMO, and we need to stay compatible even with somewhat older Hadoop installs because they are time-consuming to update. Having the whole community on a small set of versions leads to a better experience for everyone and also to more of a "network effect": more people can battle-test new versions, answer questions about them online, write libraries that easily reach the majority of Spark users, etc.