Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Matei Zaharia Sun, 27 Jul 2014 16:55:17 -0700

We could also do this, though it would be great if the Hadoop project provided 
this version number as at least a baseline. It's up to distributors to decide 
which version they report but I imagine they won't remove stuff that's in the 
reported version number.


Matei

On Jul 27, 2014, at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:

> Good idea, although it gets difficult in the context of multiple
> distributions. Say change X is not present in version A, but present
> in version B. If you depend on X, what version can you look for to
> detect it? The distribution will return "A" or "A+X" or somesuch, but
> testing for "A" will give an incorrect answer, and the code can't be
> expected to look for everyone's "A+X" versions. Actually inspecting
> the code is more robust if a bit messier.
> 
> On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia <matei.zaha...@gmail.com> 
> wrote:
>> For this particular issue, it would be good to know if Hadoop provides an 
>> API to determine the Hadoop version. If not, maybe that can be added to 
>> Hadoop in its next release, and we can check for it with reflection. We 
>> recently added a SparkContext.version() method in Spark to let you tell the 
>> version.
>> 
>> Matei
>> 
>> On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pwend...@gmail.com> wrote:
>> 
>>> Hey Ted,
>>> 
>>> We always intend Spark to work with the newer Hadoop versions and
>>> encourage Spark users to use the newest Hadoop versions for best
>>> performance.
>>> 
>>> We do try to be liberal in terms of supporting older versions as well.
>>> This is because many people run older HDFS versions and we want Spark
>>> to read and write data from them. So far we've been willing to do this
>>> despite some maintenance cost.
>>> 
>>> The reason is that for many users it's very expensive to do a
>>> whole-sale upgrade of HDFS, but trying out new versions of Spark is
>>> much easier. For instance, some of the largest scale Spark users run
>>> fairly old or forked HDFS versions.
>>> 
>>> - Patrick
>>> 
>>> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>> Thanks for replying, Patrick.
>>>> 
>>>> The intention of my first email was for utilizing newer hadoop releases for
>>>> their bug fixes. I am still looking for clean way of passing hadoop release
>>>> version number to individual classes.
>>>> Using newer hadoop releases would encourage pushing bug fixes / new
>>>> features upstream. Ultimately Spark code would become cleaner.
>>>> 
>>>> Cheers
>>>> 
>>>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pwend...@gmail.com> 
>>>> wrote:
>>>> 
>>>>> Ted - technically I think you are correct, although I wouldn't
>>>>> recommend disabling this lock. This lock is not expensive (acquired
>>>>> once per task, as are many other locks already). Also, we've seen some
>>>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>>>>> - concurrency of client access is not well tested in the Hadoop
>>>>> codebase since most of the Hadoop tools to not use concurrent access.
>>>>> So in general it's good to be conservative in what we expect of the
>>>>> Hadoop client libraries.
>>>>> 
>>>>> If you'd like to discuss this further, please fork a new thread, since
>>>>> this is a vote thread. Thanks!
>>>>> 
>>>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>> HADOOP-10456 is fixed in hadoop 2.4.1
>>>>>> 
>>>>>> Does this mean that synchronization
>>>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>>>>>> 2.4.1 ?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pwend...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> The most important issue in this release is actually an ammendment to
>>>>>>> an earlier fix. The original fix caused a deadlock which was a
>>>>>>> regression from 1.0.0->1.0.1:
>>>>>>> 
>>>>>>> Issue:
>>>>>>> https://issues.apache.org/jira/browse/SPARK-1097
>>>>>>> 
>>>>>>> 1.0.1 Fix:
>>>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>>>>>>> 
>>>>>>> 1.0.2 Fix:
>>>>>>> https://github.com/apache/spark/pull/1409/files
>>>>>>> 
>>>>>>> I failed to correctly label this on JIRA, but I've updated it!
>>>>>>> 
>>>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>>>>>>> <mich...@databricks.com> wrote:
>>>>>>>> That query is looking at "Fix Version" not "Target Version".  The fact
>>>>>>> that
>>>>>>>> the first one is still open is only because the bug is not resolved in
>>>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
>>>>>>> 1.0.2,
>>>>>>>> but is not worth blocking the release for.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> 
>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>>>>>>>>>> .
>>>>>>>>> Should they be edited somehow?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>>>>>>>>> tathagata.das1...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version
>>>>>>>>>> 1.0.2.
>>>>>>>>>> 
>>>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
>>>>>>>>>> Some of the notable ones are
>>>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>>>>> for
>>>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
>>>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>>>>>>>>>> HDFS CSV file.
>>>>>>>>>> The full list is at http://s.apache.org/9NJ
>>>>>>>>>> 
>>>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>>>>>>>>>> 
>>>>>>>>>> The release files, including signatures, digests, etc can be found
>>>>> at:
>>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
>>>>>>>>>> 
>>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
>>>>>>>>>> 
>>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>> 
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
>>>>>>>>>> 
>>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>>>>>>>>>> 
>>>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
>>>>>>>>>> 
>>>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>> 
>>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>>> http://spark.apache.org/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Reply via email to