In ideal situation, +1 on removing all vendor specific builds and making just hadoop version specific - that is what we should depend on anyway. Though I hope Sean is correct in assuming that vendor specific builds for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause incompatibilities for us or our users !
Regards, Mridul On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <so...@cloudera.com> wrote: > Yes, you should always find working bits at Apache no matter what -- > though 'no matter what' really means 'as long as you use Hadoop distro > compatible with upstream Hadoop'. Even distros have a strong interest > in that, since the market, the 'pie', is made large by this kind of > freedom at the core. > > If tso, then no vendor-specific builds are needed, only some > Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be > good (although I'm not yet clear if there's something about 2.5 or 2.6 > that needs a different build.) > > I take it that we already believe that, say, the "Hadoop 2.4" build > works with CDH5, so no CDH5-specific build is provided by Spark. > > If a distro doesn't work with stock Spark, then it's either something > Spark should fix (e.g. use of a private YARN API or something), or > it's something the distro should really fix because it's incompatible. > > Could we maybe rename the "CDH4" build then, as it doesn't really work > with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated > before. And can we remove the MapR builds -- or else can someone > explain why these exist separately from a Hadoop 2.3 build? I hope it > is not *because* they are somehow non-standard. And shall we first run > down why Spark doesn't fully work on HDP and see if it's something > that Spark or HDP needs to tweak, rather than contemplate another > binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN > whatever" build and not made specific to a vendor, even if the project > has to field another tarball combo for a vendor? > > Maybe we are saying almost the same thing. > > > On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: >> Yeah, my concern is that people should get Apache Spark from *Apache*, not >> from a vendor. It helps everyone use the latest features no matter where >> they are. In the Hadoop distro case, Hadoop made all this effort to have >> standard APIs (e.g. YARN), so it should be easy. But it is a problem if >> we're not packaging for the newest versions of some distros; I think we just >> fell behind at Hadoop 2.4. >> >> Matei >> >>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>> Yeah it's not much overhead, but here's an example of where it causes >>> a little issue. >>> >>> I like that reasoning. However, the released builds don't track the >>> later versions of Hadoop that vendors would be distributing -- there's >>> no Hadoop 2.6 build for example. CDH4 is here, but not the >>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't >>> actually work with many CDH4 versions. >>> >>> I agree with the goal of maximizing the reach of Spark, but I don't >>> know how much these builds advance that goal. >>> >>> Anyone can roll-their-own exactly-right build, and the docs and build >>> have been set up to make that as simple as can be expected. So these >>> aren't *required* to let me use latest Spark on distribution X. >>> >>> I had thought these existed to sorta support 'legacy' distributions, >>> like CDH4, and that build was justified as a >>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what >>> the MapR profiles are for. >>> >>> I think it's too much work to correctly, in parallel, maintain any >>> customizations necessary for any major distro, and it might be best to >>> do not at all than to do it incompletely. You could say it's also an >>> enabler for distros to vary in ways that require special >>> customization. >>> >>> Maybe there's a concern that, if lots of people consume Spark on >>> Hadoop, and most people consume Hadoop through distros, and distros >>> alone manage Spark distributions, then you de facto 'have to' go >>> through a distro instead of get bits from Spark? Different >>> conversation but I think this sort of effect does not end up being a >>> negative. >>> >>> Well anyway, I like the idea of seeing how far Hadoop-provided >>> releases can help. It might kill several birds with one stone. >>> >>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <matei.zaha...@gmail.com> >>> wrote: >>>> Our goal is to let people use the latest Apache release even if vendors >>>> fall behind or don't want to package everything, so that's why we put out >>>> releases for vendors' versions. It's fairly low overhead. >>>> >>>> Matei >>>> >>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote: >>>>> >>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball >>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the >>>>> Maven artifacts. >>>>> >>>>> Patrick I see you just commented on SPARK-5134 and will follow up >>>>> there. Sounds like this may accidentally not be a problem. >>>>> >>>>> On binary tarball releases, I wonder if anyone has an opinion on my >>>>> opinion that these shouldn't be distributed for specific Hadoop >>>>> *distributions* to begin with. (Won't repeat the argument here yet.) >>>>> That resolves this n x m explosion too. >>>>> >>>>> Vendors already provide their own distribution, yes, that's their job. >>>>> >>>>> >>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ksanka...@gmail.com> >>>>> wrote: >>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop >>>>>> Distributions X ... >>>>>> >>>>>> May be one option is to have a minimum basic set (which I know is what we >>>>>> are discussing) and move the rest to spark-packages.org. There the >>>>>> vendors >>>>>> can add the latest downloads - for example when 1.4 is released, HDP can >>>>>> build a release of HDP Spark 1.4 bundle. >>>>>> >>>>>> Cheers >>>>>> <k/> >>>>>> >>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pwend...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> We probably want to revisit the way we do binaries in general for >>>>>>> 1.4+. IMO, something worth forking a separate thread for. >>>>>>> >>>>>>> I've been hesitating to add new binaries because people >>>>>>> (understandably) complain if you ever stop packaging older ones, but >>>>>>> on the other hand the ASF has complained that we have too many >>>>>>> binaries already and that we need to pare it down because of the large >>>>>>> volume of files. Doubling the number of binaries we produce for Scala >>>>>>> 2.11 seemed like it would be too much. >>>>>>> >>>>>>> One solution potentially is to actually package "Hadoop provided" >>>>>>> binaries and encourage users to use these by simply setting >>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard >>>>>>> that our existing packages don't work well on HDP for instance, since >>>>>>> there are some configuration quirks that differ from the upstream >>>>>>> Hadoop. >>>>>>> >>>>>>> If we cut down on the cross building for Hadoop versions, then it is >>>>>>> more tenable to cross build for Scala versions without exploding the >>>>>>> number of binaries. >>>>>>> >>>>>>> - Patrick >>>>>>> >>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote: >>>>>>>> Yeah, interesting question of what is the better default for the >>>>>>>> single set of artifacts published to Maven. I think there's an >>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros >>>>>>>> and cons discussed more at >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134 >>>>>>>> https://github.com/apache/spark/pull/3917 >>>>>>>> >>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <matei.zaha...@gmail.com> >>>>>>>> wrote: >>>>>>>>> +1 >>>>>>>>> >>>>>>>>> Tested it on Mac OS X. >>>>>>>>> >>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop >>>>>>>>> 1 without Hive, which is kind of weird because people will more >>>>>>>>> likely want >>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that >>>>>>>>> configuration instead. We can do it if we do a new RC, or it might be >>>>>>>>> that >>>>>>>>> binary builds may not need to be voted on (I forgot the details >>>>>>>>> there). >>>>>>>>> >>>>>>>>> Matei >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>>>>> >>>>>> >>>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org