Re: Thoughts on Spark 3 release, or a preview release

Xingbo Jiang Fri, 13 Sep 2019 11:31:24 -0700

Hi all,

I would like to volunteer to be the release manager of Spark 3 preview,
thanks!


Sean Owen <[email protected]> 于2019年9月13日周五 上午11:21写道：

> Well, great to hear the unanimous support for a Spark 3 preview
> release. Now, I don't know how to make releases myself :) I would
> first open it up to our revered release managers: would anyone be
> interested in trying to make one? sounds like it's not too soon to get
> what's in master out for evaluation, as there aren't any major
> deficiencies left, although a number of items to consider for the
> final release.
>
> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
> order to make it possible to test with JDK 11. (We're only on Scala
> 2.12 at this point.)
>
> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <[email protected]> wrote:
> >
> > +1! Long due for a preview release.
> >
> >
> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <[email protected]>
> wrote:
> >>
> >> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> >>
> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <[email protected]> wrote:
> >>>
> >>> +1  Like the idea as a user and a DSv2 contributor.
> >>>
> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <[email protected]>
> wrote:
> >>>>
> >>>> +1 (as a contributor) from me to have preview release on Spark 3 as
> it would help to test the feature. When to cut preview release is
> questionable, as major works are ideally to be done before that - if we are
> intended to introduce new features before official release, that should
> work regardless of this, but if we are intended to have opportunity to test
> earlier, ideally it should.
> >>>>
> >>>> As a one of contributors in structured streaming area, I'd like to
> add some items for Spark 3.0, both "must be done" and "better to have". For
> "better to have", I pick some items for new features which committers
> reviewed couple of rounds and dropped off without soft-reject (No valid
> reason to stop). For Spark 2.4 users, only added feature for structured
> streaming is Kafka delegation token. (given we assume revising Kafka
> consumer pool as improvement) I hope we provide some gifts for structured
> streaming users in Spark 3.0 envelope.
> >>>>
> >>>> > must be done
> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
> inconsistent output
> >>>> It's a correctness issue with multiple users reported, being reported
> at Nov. 2018. There's a way to reproduce it consistently, and we have a
> patch submitted at Jan. 2019 to fix it.
> >>>>
> >>>> > better to have
> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
> timestamp to start and end offset
> >>>> * SPARK-20568 Delete files after processing in structured streaming
> >>>>
> >>>> There're some more new features/improvements items in SS, but given
> we're talking about ramping-down, above list might be realistic one.
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <[email protected]>
> wrote:
> >>>>>
> >>>>> As a user/non committer, +1
> >>>>>
> >>>>> I love the idea of an early 3.0.0 so we can test current dev against
> it, I know the final 3.x will probably need another round of testing when
> it gets out, but less for sure... I know I could checkout and compile, but
> having a “packaged” preversion is great if it does not take too much time
> to the team...
> >>>>>
> >>>>> jg
> >>>>>
> >>>>>
> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <[email protected]> wrote:
> >>>>>
> >>>>> +1 from me too but I would like to know what other people think too.
> >>>>>
> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <[email protected]>님이
> 작성:
> >>>>>>
> >>>>>> Thank you, Sean.
> >>>>>>
> >>>>>> I'm also +1 for the following three.
> >>>>>>
> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
> >>>>>> 3. Apache Spark 3.0.0 in early 2020
> >>>>>>
> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
> helps it a lot.
> >>>>>>
> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
> Release Window` in our versioning-policy page?
> >>>>>>
> >>>>>> - https://spark.apache.org/versioning-policy.html
> >>>>>>
> >>>>>> Bests,
> >>>>>> Dongjoon.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
> problems resolved, e.g.
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
> >>>>>>>
> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.
> As far as I know, Parquet has not cut a release based on this new version.
> >>>>>>>
> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
> >>>>>>>
> >>>>>>> https://github.com/apache/spark/pull/24851
> >>>>>>> https://github.com/apache/spark/pull/24297
> >>>>>>>
> >>>>>>>    michael
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <[email protected]> wrote:
> >>>>>>>
> >>>>>>> I'm curious what current feelings are about ramping down towards a
> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
> >>>>>>> though in the past we had informally tossed around "back end of
> 2019".
> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
> expect
> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
> coming
> >>>>>>> due.
> >>>>>>>
> >>>>>>> What are the few major items that must get done for Spark 3, in
> your
> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> >>>>>>> should feel free to update with things that aren't really needed
> for
> >>>>>>> Spark 3; I already triaged some).
> >>>>>>>
> >>>>>>> For me, it's:
> >>>>>>> - DSv2?
> >>>>>>> - Finishing touches on the Hive, JDK 11 update
> >>>>>>>
> >>>>>>> What about considering a preview release earlier, as happened for
> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
> >>>>>>> even happen ... about now?
> >>>>>>>
> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
> is. My
> >>>>>>> guess is quite early 2020, from here.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
> catalog uses
> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> >>>>>>> SPARK-28588 Build a SQL reference doc
> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> >>>>>>> SPARK-28684 Hive module support JDK 11
> >>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
> >>>>>>> after some operations
> >>>>>>> SPARK-28372 Document Spark WEB UI
> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
> multi-catalog
> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
> >>>>>>> relation table properly
> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
> >>>>>>> smoother upgrade
> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when
> the #
> >>>>>>> of joined tables > 12
> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
> >>>>>>> SPARK-27520 Introduce a global config system to replace
> hadoopConfiguration
> >>>>>>> SPARK-24625 put all the backward compatible behavior change configs
> >>>>>>> under spark.sql.legacy.*
> >>>>>>> SPARK-24640 size(null) returns null
> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
> operators
> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
> >>>>>>> SPARK-25383 Image data source supports sample pushdown
> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures
> by default
> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> >>>>>>> efficiency problem
> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
> backend
> >>>>>>> cause driver pods to hang
> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
> containing
> >>>>>>> barrier stage
> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
> logical Aggregate
> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
> standard
> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
> >>>>>>> avoid checkpoint corruption
> >>>>>>> SPARK-25843 Redesign rangeBetween API
> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
> >>>>>>> produce named output from CleanupAliases
> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
> window aggregate
> >>>>>>> SPARK-25531 new write APIs for data source v2
> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
> Kubernetes
> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
> Mesos
> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
> >>>>>>> MesosFineGrainedSchedulerBackend
> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
> barrier
> >>>>>>> execution mode
> >>>>>>> SPARK-25390 data source V2 API refactoring
> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
> Partition Spec
> >>>>>>> SPARK-15691 Refactor and improve Hive support
> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
> >>>>>>> SPARK-16217 Support SELECT INTO statement
> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
> >>>>>>> SPARK-18245 Improving support for bucketed table
> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
> Support in Spark
> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
> nested
> >>>>>>> list of structures
> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame
> to
> >>>>>>> respect session timezone
> >>>>>>> SPARK-22386 Data Source V2 improvements
> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
> YARN
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe e-mail: [email protected]
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Name : Jungtaek Lim
> >>>> Blog : http://medium.com/@heartsavior
> >>>> Twitter : http://twitter.com/heartsavior
> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
> >>>
> >>>
> >>>
> >>> --
> >>> John Zhuge
> >>
> >>
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to