Does the Apache project team have any ability to measure download counts of the various releases? That data could be useful when it comes time to sunset vendor-specific releases, like CDH4 for example.
On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan <mri...@gmail.com> wrote: > In ideal situation, +1 on removing all vendor specific builds and > making just hadoop version specific - that is what we should depend on > anyway. > Though I hope Sean is correct in assuming that vendor specific builds > for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause > incompatibilities for us or our users ! > > Regards, > Mridul > > > On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <so...@cloudera.com> wrote: > > Yes, you should always find working bits at Apache no matter what -- > > though 'no matter what' really means 'as long as you use Hadoop distro > > compatible with upstream Hadoop'. Even distros have a strong interest > > in that, since the market, the 'pie', is made large by this kind of > > freedom at the core. > > > > If tso, then no vendor-specific builds are needed, only some > > Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be > > good (although I'm not yet clear if there's something about 2.5 or 2.6 > > that needs a different build.) > > > > I take it that we already believe that, say, the "Hadoop 2.4" build > > works with CDH5, so no CDH5-specific build is provided by Spark. > > > > If a distro doesn't work with stock Spark, then it's either something > > Spark should fix (e.g. use of a private YARN API or something), or > > it's something the distro should really fix because it's incompatible. > > > > Could we maybe rename the "CDH4" build then, as it doesn't really work > > with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated > > before. And can we remove the MapR builds -- or else can someone > > explain why these exist separately from a Hadoop 2.3 build? I hope it > > is not *because* they are somehow non-standard. And shall we first run > > down why Spark doesn't fully work on HDP and see if it's something > > that Spark or HDP needs to tweak, rather than contemplate another > > binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN > > whatever" build and not made specific to a vendor, even if the project > > has to field another tarball combo for a vendor? > > > > Maybe we are saying almost the same thing. > > > > > > On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> Yeah, my concern is that people should get Apache Spark from *Apache*, > not from a vendor. It helps everyone use the latest features no matter > where they are. In the Hadoop distro case, Hadoop made all this effort to > have standard APIs (e.g. YARN), so it should be easy. But it is a problem > if we're not packaging for the newest versions of some distros; I think we > just fell behind at Hadoop 2.4. > >> > >> Matei > >> > >>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote: > >>> > >>> Yeah it's not much overhead, but here's an example of where it causes > >>> a little issue. > >>> > >>> I like that reasoning. However, the released builds don't track the > >>> later versions of Hadoop that vendors would be distributing -- there's > >>> no Hadoop 2.6 build for example. CDH4 is here, but not the > >>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't > >>> actually work with many CDH4 versions. > >>> > >>> I agree with the goal of maximizing the reach of Spark, but I don't > >>> know how much these builds advance that goal. > >>> > >>> Anyone can roll-their-own exactly-right build, and the docs and build > >>> have been set up to make that as simple as can be expected. So these > >>> aren't *required* to let me use latest Spark on distribution X. > >>> > >>> I had thought these existed to sorta support 'legacy' distributions, > >>> like CDH4, and that build was justified as a > >>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what > >>> the MapR profiles are for. > >>> > >>> I think it's too much work to correctly, in parallel, maintain any > >>> customizations necessary for any major distro, and it might be best to > >>> do not at all than to do it incompletely. You could say it's also an > >>> enabler for distros to vary in ways that require special > >>> customization. > >>> > >>> Maybe there's a concern that, if lots of people consume Spark on > >>> Hadoop, and most people consume Hadoop through distros, and distros > >>> alone manage Spark distributions, then you de facto 'have to' go > >>> through a distro instead of get bits from Spark? Different > >>> conversation but I think this sort of effect does not end up being a > >>> negative. > >>> > >>> Well anyway, I like the idea of seeing how far Hadoop-provided > >>> releases can help. It might kill several birds with one stone. > >>> > >>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia < > matei.zaha...@gmail.com> wrote: > >>>> Our goal is to let people use the latest Apache release even if > vendors fall behind or don't want to package everything, so that's why we > put out releases for vendors' versions. It's fairly low overhead. > >>>> > >>>> Matei > >>>> > >>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote: > >>>>> > >>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 > tarball > >>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the > >>>>> Maven artifacts. > >>>>> > >>>>> Patrick I see you just commented on SPARK-5134 and will follow up > >>>>> there. Sounds like this may accidentally not be a problem. > >>>>> > >>>>> On binary tarball releases, I wonder if anyone has an opinion on my > >>>>> opinion that these shouldn't be distributed for specific Hadoop > >>>>> *distributions* to begin with. (Won't repeat the argument here yet.) > >>>>> That resolves this n x m explosion too. > >>>>> > >>>>> Vendors already provide their own distribution, yes, that's their > job. > >>>>> > >>>>> > >>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ksanka...@gmail.com> > wrote: > >>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X > Hadoop > >>>>>> Distributions X ... > >>>>>> > >>>>>> May be one option is to have a minimum basic set (which I know is > what we > >>>>>> are discussing) and move the rest to spark-packages.org. There the > vendors > >>>>>> can add the latest downloads - for example when 1.4 is released, > HDP can > >>>>>> build a release of HDP Spark 1.4 bundle. > >>>>>> > >>>>>> Cheers > >>>>>> <k/> > >>>>>> > >>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pwend...@gmail.com> > wrote: > >>>>>>> > >>>>>>> We probably want to revisit the way we do binaries in general for > >>>>>>> 1.4+. IMO, something worth forking a separate thread for. > >>>>>>> > >>>>>>> I've been hesitating to add new binaries because people > >>>>>>> (understandably) complain if you ever stop packaging older ones, > but > >>>>>>> on the other hand the ASF has complained that we have too many > >>>>>>> binaries already and that we need to pare it down because of the > large > >>>>>>> volume of files. Doubling the number of binaries we produce for > Scala > >>>>>>> 2.11 seemed like it would be too much. > >>>>>>> > >>>>>>> One solution potentially is to actually package "Hadoop provided" > >>>>>>> binaries and encourage users to use these by simply setting > >>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard > >>>>>>> that our existing packages don't work well on HDP for instance, > since > >>>>>>> there are some configuration quirks that differ from the upstream > >>>>>>> Hadoop. > >>>>>>> > >>>>>>> If we cut down on the cross building for Hadoop versions, then it > is > >>>>>>> more tenable to cross build for Scala versions without exploding > the > >>>>>>> number of binaries. > >>>>>>> > >>>>>>> - Patrick > >>>>>>> > >>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> > wrote: > >>>>>>>> Yeah, interesting question of what is the better default for the > >>>>>>>> single set of artifacts published to Maven. I think there's an > >>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. > Pros > >>>>>>>> and cons discussed more at > >>>>>>>> > >>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134 > >>>>>>>> https://github.com/apache/spark/pull/3917 > >>>>>>>> > >>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia < > matei.zaha...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>> +1 > >>>>>>>>> > >>>>>>>>> Tested it on Mac OS X. > >>>>>>>>> > >>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using > Hadoop > >>>>>>>>> 1 without Hive, which is kind of weird because people will more > likely want > >>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for > that > >>>>>>>>> configuration instead. We can do it if we do a new RC, or it > might be that > >>>>>>>>> binary builds may not need to be voted on (I forgot the details > there). > >>>>>>>>> > >>>>>>>>> Matei > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>>>>> > >>>>>> > >>>> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >