Re: [VOTE] Apache Spark 2.1.1 (RC3)

Michael Armbrust Mon, 24 Apr 2017 12:40:16 -0700

Yeah, I agree.

-1 (binding)


This vote fails, and I'll cut a new RC after #17749
<https://github.com/apache/spark/pull/17749> is merged.

On Mon, Apr 24, 2017 at 12:18 PM, Eric Liang <e...@databricks.com> wrote:

> -1 (non-binding)
>
> I also agree with using NEVER_INFER for 2.1.1. The migration cost is
> unexpected for a point release.
>
> On Mon, Apr 24, 2017 at 11:08 AM Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> Whoops, sorry finger slipped on that last message.
>> It sounds like whatever we do is going to break some existing users
>> (either with the tables by case sensitivity or with the unexpected scan).
>>
>> Personally I agree with Michael Allman on this, I believe we should
>> use INFER_NEVER for 2.1.1.
>>
>> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> It
>>>
>>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman <mich...@videoamp.com>
>>> wrote:
>>>
>>>> The trouble we ran into is that this upgrade was blocking access to our
>>>> tables, and we didn't know why. This sounds like a kind of migration
>>>> operation, but it was not apparent that this was the case. It took an
>>>> expert examining a stack trace and source code to figure this out. Would a
>>>> more naive end user be able to debug this issue? Maybe we're an unusual
>>>> case, but our particular experience was pretty bad. I have my doubts that
>>>> the schema inference on our largest tables would ever complete without
>>>> throwing some kind of timeout (which we were in fact receiving) or the end
>>>> user just giving up and killing our job. We ended up doing a rollback while
>>>> we investigated the source of the issue. In our case, INFER_NEVER is
>>>> clearly the best configuration. We're going to add that to our default
>>>> configuration files.
>>>>
>>>> My expectation is that a minor point release is a pretty safe bug fix
>>>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>>>
>>>> One suggestion the Spark team might consider is releasing 2.1.1 with
>>>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of
>>>> up-front migration notes would help in identifying this new behavior in 
>>>> 2.2.
>>>>
>>>> Thanks,
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan <wenc...@databricks.com>
>>>> wrote:
>>>>
>>>> see https://issues.apache.org/jira/browse/SPARK-19611
>>>>
>>>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Whats the regression this fixed in 2.1 from 2.0?
>>>>>
>>>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenc...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>>>>> only scan all table files only once, and write back the inferred schema 
>>>>>> to
>>>>>> metastore so that we don't need to do the schema inference again.
>>>>>>
>>>>>> So technically this will introduce a performance regression for the
>>>>>> first query, but compared to branch-2.0, it's not performance regression.
>>>>>> And this patch fixed a regression in branch-2.1, which can run in
>>>>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>>>>> default mode.
>>>>>>
>>>>>> + [Eric], what do you think?
>>>>>>
>>>>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> Thanks for pointing this out, Michael.  Based on the conversation
>>>>>>> on the PR
>>>>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275>
>>>>>>> this seems like a risky change to include in a release branch with a
>>>>>>> default other than NEVER_INFER.
>>>>>>>
>>>>>>> +Wenchen?  What do you think?
>>>>>>>
>>>>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman <
>>>>>>> mich...@videoamp.com> wrote:
>>>>>>>
>>>>>>>> We've identified the cause of the change in behavior. It is related
>>>>>>>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode".
>>>>>>>> This key and its related functionality was absent from our previous 
>>>>>>>> build.
>>>>>>>> The default setting in the current build was causing Spark to attempt 
>>>>>>>> to
>>>>>>>> scan all table files during query analysis. Changing this setting
>>>>>>>> to NEVER_INFER disabled this operation and resolved the issue we had.
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 20, 2017, at 3:42 PM, Michael Allman <mich...@videoamp.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I want to caution that in testing a build from this morning's
>>>>>>>> branch-2.1 we found that Hive partition pruning was not working. We 
>>>>>>>> found
>>>>>>>> that Spark SQL was fetching all Hive table partitions for a very simple
>>>>>>>> query whereas in a build from several weeks ago it was fetching only 
>>>>>>>> the
>>>>>>>> required partitions. I cannot currently think of a reason for the
>>>>>>>> regression outside of some difference between branch-2.1 from our 
>>>>>>>> previous
>>>>>>>> build and branch-2.1 from this morning.
>>>>>>>>
>>>>>>>> That's all I know right now. We are actively investigating to find
>>>>>>>> the root cause of this problem, and specifically whether this is a 
>>>>>>>> problem
>>>>>>>> in the Spark codebase or not. I will report back when I have an answer 
>>>>>>>> to
>>>>>>>> that question.
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust <
>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 2.1.1. The vote is open until Fri, April 21st, 2018 at
>>>>>>>> 13:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>>> cast.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> http://spark.apache.org/
>>>>>>>>
>>>>>>>> The tag to be voted on is v2.1.1-rc3
>>>>>>>> <https://github.com/apache/spark/tree/v2.1.1-rc3> (2ed19cff2f6ab79
>>>>>>>> a718526e5d16633412d8c4dd4)
>>>>>>>>
>>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>> .
>>>>>>>>
>>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>>> at:
>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-
>>>>>>>> 2.1.1-rc3-bin/
>>>>>>>>
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>> https://repository.apache.org/content/repositories/
>>>>>>>> orgapachespark-1230/
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>>>>> 2.1.1-rc3-docs/
>>>>>>>>
>>>>>>>>
>>>>>>>> *FAQ*
>>>>>>>>
>>>>>>>> *How can I help test this release?*
>>>>>>>>
>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>> taking an existing Spark workload and running on this release 
>>>>>>>> candidate,
>>>>>>>> then reporting any regressions.
>>>>>>>>
>>>>>>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>>>>>>
>>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>>> fixes, documentation, and API tweaks that impact compatibility should 
>>>>>>>> be
>>>>>>>> worked on immediately. Everything else please retarget to 2.1.2 or 
>>>>>>>> 2.2.0.
>>>>>>>>
>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>
>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>> release unless the bug in question is a regression from 2.1.0.
>>>>>>>>
>>>>>>>> *What happened to RC1?*
>>>>>>>>
>>>>>>>> There were issues with the release packaging and as a result was
>>>>>>>> skipped.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>

Re: [VOTE] Apache Spark 2.1.1 (RC3)

Reply via email to