Yeah, I agree. -1 (binding)
This vote fails, and I'll cut a new RC after #17749 <https://github.com/apache/spark/pull/17749> is merged. On Mon, Apr 24, 2017 at 12:18 PM, Eric Liang <e...@databricks.com> wrote: > -1 (non-binding) > > I also agree with using NEVER_INFER for 2.1.1. The migration cost is > unexpected for a point release. > > On Mon, Apr 24, 2017 at 11:08 AM Holden Karau <hol...@pigscanfly.ca> > wrote: > >> Whoops, sorry finger slipped on that last message. >> It sounds like whatever we do is going to break some existing users >> (either with the tables by case sensitivity or with the unexpected scan). >> >> Personally I agree with Michael Allman on this, I believe we should >> use INFER_NEVER for 2.1.1. >> >> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> It >>> >>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman <mich...@videoamp.com> >>> wrote: >>> >>>> The trouble we ran into is that this upgrade was blocking access to our >>>> tables, and we didn't know why. This sounds like a kind of migration >>>> operation, but it was not apparent that this was the case. It took an >>>> expert examining a stack trace and source code to figure this out. Would a >>>> more naive end user be able to debug this issue? Maybe we're an unusual >>>> case, but our particular experience was pretty bad. I have my doubts that >>>> the schema inference on our largest tables would ever complete without >>>> throwing some kind of timeout (which we were in fact receiving) or the end >>>> user just giving up and killing our job. We ended up doing a rollback while >>>> we investigated the source of the issue. In our case, INFER_NEVER is >>>> clearly the best configuration. We're going to add that to our default >>>> configuration files. >>>> >>>> My expectation is that a minor point release is a pretty safe bug fix >>>> release. We were a bit hasty in not doing better due diligence pre-upgrade. >>>> >>>> One suggestion the Spark team might consider is releasing 2.1.1 with >>>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of >>>> up-front migration notes would help in identifying this new behavior in >>>> 2.2. >>>> >>>> Thanks, >>>> >>>> Michael >>>> >>>> >>>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan <wenc...@databricks.com> >>>> wrote: >>>> >>>> see https://issues.apache.org/jira/browse/SPARK-19611 >>>> >>>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >>>>> Whats the regression this fixed in 2.1 from 2.0? >>>>> >>>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenc...@databricks.com> >>>>> wrote: >>>>> >>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will >>>>>> only scan all table files only once, and write back the inferred schema >>>>>> to >>>>>> metastore so that we don't need to do the schema inference again. >>>>>> >>>>>> So technically this will introduce a performance regression for the >>>>>> first query, but compared to branch-2.0, it's not performance regression. >>>>>> And this patch fixed a regression in branch-2.1, which can run in >>>>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the >>>>>> default mode. >>>>>> >>>>>> + [Eric], what do you think? >>>>>> >>>>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust < >>>>>> mich...@databricks.com> wrote: >>>>>> >>>>>>> Thanks for pointing this out, Michael. Based on the conversation >>>>>>> on the PR >>>>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275> >>>>>>> this seems like a risky change to include in a release branch with a >>>>>>> default other than NEVER_INFER. >>>>>>> >>>>>>> +Wenchen? What do you think? >>>>>>> >>>>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman < >>>>>>> mich...@videoamp.com> wrote: >>>>>>> >>>>>>>> We've identified the cause of the change in behavior. It is related >>>>>>>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode". >>>>>>>> This key and its related functionality was absent from our previous >>>>>>>> build. >>>>>>>> The default setting in the current build was causing Spark to attempt >>>>>>>> to >>>>>>>> scan all table files during query analysis. Changing this setting >>>>>>>> to NEVER_INFER disabled this operation and resolved the issue we had. >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> On Apr 20, 2017, at 3:42 PM, Michael Allman <mich...@videoamp.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I want to caution that in testing a build from this morning's >>>>>>>> branch-2.1 we found that Hive partition pruning was not working. We >>>>>>>> found >>>>>>>> that Spark SQL was fetching all Hive table partitions for a very simple >>>>>>>> query whereas in a build from several weeks ago it was fetching only >>>>>>>> the >>>>>>>> required partitions. I cannot currently think of a reason for the >>>>>>>> regression outside of some difference between branch-2.1 from our >>>>>>>> previous >>>>>>>> build and branch-2.1 from this morning. >>>>>>>> >>>>>>>> That's all I know right now. We are actively investigating to find >>>>>>>> the root cause of this problem, and specifically whether this is a >>>>>>>> problem >>>>>>>> in the Spark codebase or not. I will report back when I have an answer >>>>>>>> to >>>>>>>> that question. >>>>>>>> >>>>>>>> Michael >>>>>>>> >>>>>>>> >>>>>>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust < >>>>>>>> mich...@databricks.com> wrote: >>>>>>>> >>>>>>>> Please vote on releasing the following candidate as Apache Spark >>>>>>>> version 2.1.1. The vote is open until Fri, April 21st, 2018 at >>>>>>>> 13:00 PST and passes if a majority of at least 3 +1 PMC votes are >>>>>>>> cast. >>>>>>>> >>>>>>>> [ ] +1 Release this package as Apache Spark 2.1.1 >>>>>>>> [ ] -1 Do not release this package because ... >>>>>>>> >>>>>>>> >>>>>>>> To learn more about Apache Spark, please see >>>>>>>> http://spark.apache.org/ >>>>>>>> >>>>>>>> The tag to be voted on is v2.1.1-rc3 >>>>>>>> <https://github.com/apache/spark/tree/v2.1.1-rc3> (2ed19cff2f6ab79 >>>>>>>> a718526e5d16633412d8c4dd4) >>>>>>>> >>>>>>>> List of JIRA tickets resolved can be found with this filter >>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1> >>>>>>>> . >>>>>>>> >>>>>>>> The release files, including signatures, digests, etc. can be found >>>>>>>> at: >>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark- >>>>>>>> 2.1.1-rc3-bin/ >>>>>>>> >>>>>>>> Release artifacts are signed with the following key: >>>>>>>> https://people.apache.org/keys/committer/pwendell.asc >>>>>>>> >>>>>>>> The staging repository for this release can be found at: >>>>>>>> https://repository.apache.org/content/repositories/ >>>>>>>> orgapachespark-1230/ >>>>>>>> >>>>>>>> The documentation corresponding to this release can be found at: >>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark- >>>>>>>> 2.1.1-rc3-docs/ >>>>>>>> >>>>>>>> >>>>>>>> *FAQ* >>>>>>>> >>>>>>>> *How can I help test this release?* >>>>>>>> >>>>>>>> If you are a Spark user, you can help us test this release by >>>>>>>> taking an existing Spark workload and running on this release >>>>>>>> candidate, >>>>>>>> then reporting any regressions. >>>>>>>> >>>>>>>> *What should happen to JIRA tickets still targeting 2.1.1?* >>>>>>>> >>>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>>>>> be >>>>>>>> worked on immediately. Everything else please retarget to 2.1.2 or >>>>>>>> 2.2.0. >>>>>>>> >>>>>>>> *But my bug isn't fixed!??!* >>>>>>>> >>>>>>>> In order to make timely releases, we will typically not hold the >>>>>>>> release unless the bug in question is a regression from 2.1.0. >>>>>>>> >>>>>>>> *What happened to RC1?* >>>>>>>> >>>>>>>> There were issues with the release packaging and as a result was >>>>>>>> skipped. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Cell : 425-233-8271 <(425)%20233-8271> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Cell : 425-233-8271 <(425)%20233-8271> >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >> >> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> >