Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Mridul Muralidharan Mon, 09 Mar 2015 06:02:55 -0700

In ideal situation, +1 on removing all vendor specific builds and
making just hadoop version specific - that is what we should depend on
anyway.
Though I hope Sean is correct in assuming that vendor specific builds
for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
incompatibilities for us or our users !


Regards,
Mridul


On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <so...@cloudera.com> wrote:
> Yes, you should always find working bits at Apache no matter what --
> though 'no matter what' really means 'as long as you use Hadoop distro
> compatible with upstream Hadoop'. Even distros have a strong interest
> in that, since the market, the 'pie', is made large by this kind of
> freedom at the core.
>
> If tso, then no vendor-specific builds are needed, only some
> Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
> good (although I'm not yet clear if there's something about 2.5 or 2.6
> that needs a different build.)
>
> I take it that we already believe that, say, the "Hadoop 2.4" build
> works with CDH5, so no CDH5-specific build is provided by Spark.
>
> If a distro doesn't work with stock Spark, then it's either something
> Spark should fix (e.g. use of a private YARN API or something), or
> it's something the distro should really fix because it's incompatible.
>
> Could we maybe rename the "CDH4" build then, as it doesn't really work
> with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated
> before. And can we remove the MapR builds -- or else can someone
> explain why these exist separately from a Hadoop 2.3 build? I hope it
> is not *because* they are somehow non-standard. And shall we first run
> down why Spark doesn't fully work on HDP and see if it's something
> that Spark or HDP needs to tweak, rather than contemplate another
> binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN
> whatever" build and not made specific to a vendor, even if the project
> has to field another tarball combo for a vendor?
>
> Maybe we are saying almost the same thing.
>
>
> On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
>> Yeah, my concern is that people should get Apache Spark from *Apache*, not 
>> from a vendor. It helps everyone use the latest features no matter where 
>> they are. In the Hadoop distro case, Hadoop made all this effort to have 
>> standard APIs (e.g. YARN), so it should be easy. But it is a problem if 
>> we're not packaging for the newest versions of some distros; I think we just 
>> fell behind at Hadoop 2.4.
>>
>> Matei
>>
>>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> Yeah it's not much overhead, but here's an example of where it causes
>>> a little issue.
>>>
>>> I like that reasoning. However, the released builds don't track the
>>> later versions of Hadoop that vendors would be distributing -- there's
>>> no Hadoop 2.6 build for example. CDH4 is here, but not the
>>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
>>> actually work with many CDH4 versions.
>>>
>>> I agree with the goal of maximizing the reach of Spark, but I don't
>>> know how much these builds advance that goal.
>>>
>>> Anyone can roll-their-own exactly-right build, and the docs and build
>>> have been set up to make that as simple as can be expected. So these
>>> aren't *required* to let me use latest Spark on distribution X.
>>>
>>> I had thought these existed to sorta support 'legacy' distributions,
>>> like CDH4, and that build was justified as a
>>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
>>> the MapR profiles are for.
>>>
>>> I think it's too much work to correctly, in parallel, maintain any
>>> customizations necessary for any major distro, and it might be best to
>>> do not at all than to do it incompletely. You could say it's also an
>>> enabler for distros to vary in ways that require special
>>> customization.
>>>
>>> Maybe there's a concern that, if lots of people consume Spark on
>>> Hadoop, and most people consume Hadoop through distros, and distros
>>> alone manage Spark distributions, then you de facto 'have to' go
>>> through a distro instead of get bits from Spark? Different
>>> conversation but I think this sort of effect does not end up being a
>>> negative.
>>>
>>> Well anyway, I like the idea of seeing how far Hadoop-provided
>>> releases can help. It might kill several birds with one stone.
>>>
>>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <matei.zaha...@gmail.com> 
>>> wrote:
>>>> Our goal is to let people use the latest Apache release even if vendors 
>>>> fall behind or don't want to package everything, so that's why we put out 
>>>> releases for vendors' versions. It's fairly low overhead.
>>>>
>>>> Matei
>>>>
>>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>>>>> Maven artifacts.
>>>>>
>>>>> Patrick I see you just commented on SPARK-5134 and will follow up
>>>>> there. Sounds like this may accidentally not be a problem.
>>>>>
>>>>> On binary tarball releases, I wonder if anyone has an opinion on my
>>>>> opinion that these shouldn't be distributed for specific Hadoop
>>>>> *distributions* to begin with. (Won't repeat the argument here yet.)
>>>>> That resolves this n x m explosion too.
>>>>>
>>>>> Vendors already provide their own distribution, yes, that's their job.
>>>>>
>>>>>
>>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ksanka...@gmail.com> 
>>>>> wrote:
>>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>>>>> Distributions X ...
>>>>>>
>>>>>> May be one option is to have a minimum basic set (which I know is what we
>>>>>> are discussing) and move the rest to spark-packages.org. There the 
>>>>>> vendors
>>>>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>>>>> build a release of HDP Spark 1.4 bundle.
>>>>>>
>>>>>> Cheers
>>>>>> <k/>
>>>>>>
>>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pwend...@gmail.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> We probably want to revisit the way we do binaries in general for
>>>>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>>>>
>>>>>>> I've been hesitating to add new binaries because people
>>>>>>> (understandably) complain if you ever stop packaging older ones, but
>>>>>>> on the other hand the ASF has complained that we have too many
>>>>>>> binaries already and that we need to pare it down because of the large
>>>>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>>>>> 2.11 seemed like it would be too much.
>>>>>>>
>>>>>>> One solution potentially is to actually package "Hadoop provided"
>>>>>>> binaries and encourage users to use these by simply setting
>>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>>>>> that our existing packages don't work well on HDP for instance, since
>>>>>>> there are some configuration quirks that differ from the upstream
>>>>>>> Hadoop.
>>>>>>>
>>>>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>>>>> more tenable to cross build for Scala versions without exploding the
>>>>>>> number of binaries.
>>>>>>>
>>>>>>> - Patrick
>>>>>>>
>>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>> Yeah, interesting question of what is the better default for the
>>>>>>>> single set of artifacts published to Maven. I think there's an
>>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>>>>> and cons discussed more at
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>>>>> https://github.com/apache/spark/pull/3917
>>>>>>>>
>>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Tested it on Mac OS X.
>>>>>>>>>
>>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>>>>> 1 without Hive, which is kind of weird because people will more 
>>>>>>>>> likely want
>>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>>>>> configuration instead. We can do it if we do a new RC, or it might be 
>>>>>>>>> that
>>>>>>>>> binary builds may not need to be voted on (I forgot the details 
>>>>>>>>> there).
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>>>
>>>>>>
>>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Reply via email to