Re: Thoughts on Spark 3 release, or a preview release

Andrew Melo Fri, 13 Sep 2019 14:27:46 -0700

Hi Spark Aficionados-

On Fri, Sep 13, 2019 at 15:08 Ryan Blue <[email protected]> wrote:


> +1 for a preview release.
>
> DSv2 is quite close to being ready. I can only think of a couple issues
> that we need to merge, like getting a fix for stats estimation done. I'll
> have a better idea once I've caught up from being away for ApacheCon and
> I'll add this to the agenda for our next DSv2 sync on Wednesday.
>

What does 3.0 mean for the DSv2 API? Does the API freeze at that point, or
would it still be allowed to change? I'm writing a DSv2 plug-in
(GitHub.com/spark-root/laurelin) and there's a couple little API things I
think could be useful, I've just not had time to write here/open a JIRA
about.

Thanks
Andrew


> On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun <[email protected]>
> wrote:
>
>> Ur, Sean.
>>
>> I prefer a full release like 2.0.0-preview.
>>
>> https://archive.apache.org/dist/spark/spark-2.0.0-preview/
>>
>> And, thank you, Xingbo!
>> Could you take a look at website generation? It seems to be broken on
>> `master`.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to volunteer to be the release manager of Spark 3 preview,
>>> thanks!
>>>
>>> Sean Owen <[email protected]> 于2019年9月13日周五 上午11:21写道：
>>>
>>>> Well, great to hear the unanimous support for a Spark 3 preview
>>>> release. Now, I don't know how to make releases myself :) I would
>>>> first open it up to our revered release managers: would anyone be
>>>> interested in trying to make one? sounds like it's not too soon to get
>>>> what's in master out for evaluation, as there aren't any major
>>>> deficiencies left, although a number of items to consider for the
>>>> final release.
>>>>
>>>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
>>>> order to make it possible to test with JDK 11. (We're only on Scala
>>>> 2.12 at this point.)
>>>>
>>>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <[email protected]>
>>>> wrote:
>>>> >
>>>> > +1! Long due for a preview release.
>>>> >
>>>> >
>>>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> I like the idea from the PoV of giving folks something to start
>>>> testing against and exploring so they can raise issues with us earlier in
>>>> the process and we have more time to make calls around this.
>>>> >>
>>>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> +1  Like the idea as a user and a DSv2 contributor.
>>>> >>>
>>>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <[email protected]>
>>>> wrote:
>>>> >>>>
>>>> >>>> +1 (as a contributor) from me to have preview release on Spark 3
>>>> as it would help to test the feature. When to cut preview release is
>>>> questionable, as major works are ideally to be done before that - if we are
>>>> intended to introduce new features before official release, that should
>>>> work regardless of this, but if we are intended to have opportunity to test
>>>> earlier, ideally it should.
>>>> >>>>
>>>> >>>> As a one of contributors in structured streaming area, I'd like to
>>>> add some items for Spark 3.0, both "must be done" and "better to have". For
>>>> "better to have", I pick some items for new features which committers
>>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>>> consumer pool as improvement) I hope we provide some gifts for structured
>>>> streaming users in Spark 3.0 envelope.
>>>> >>>>
>>>> >>>> > must be done
>>>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
>>>> inconsistent output
>>>> >>>> It's a correctness issue with multiple users reported, being
>>>> reported at Nov. 2018. There's a way to reproduce it consistently, and we
>>>> have a patch submitted at Jan. 2019 to fix it.
>>>> >>>>
>>>> >>>> > better to have
>>>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
>>>> timestamp to start and end offset
>>>> >>>> * SPARK-20568 Delete files after processing in structured streaming
>>>> >>>>
>>>> >>>> There're some more new features/improvements items in SS, but
>>>> given we're talking about ramping-down, above list might be realistic one.
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> As a user/non committer, +1
>>>> >>>>>
>>>> >>>>> I love the idea of an early 3.0.0 so we can test current dev
>>>> against it, I know the final 3.x will probably need another round of
>>>> testing when it gets out, but less for sure... I know I could checkout and
>>>> compile, but having a “packaged” preversion is great if it does not take
>>>> too much time to the team...
>>>> >>>>>
>>>> >>>>> jg
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <[email protected]>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> +1 from me too but I would like to know what other people think
>>>> too.
>>>> >>>>>
>>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <[email protected]>님이
>>>> 작성:
>>>> >>>>>>
>>>> >>>>>> Thank you, Sean.
>>>> >>>>>>
>>>> >>>>>> I'm also +1 for the following three.
>>>> >>>>>>
>>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>> >>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>> >>>>>>
>>>> >>>>>> For JDK11 clean-up, it will meet the timeline and
>>>> `3.0.0-preview` helps it a lot.
>>>> >>>>>>
>>>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
>>>> Release Window` in our versioning-policy page?
>>>> >>>>>>
>>>> >>>>>> - https://spark.apache.org/versioning-policy.html
>>>> >>>>>>
>>>> >>>>>> Bests,
>>>> >>>>>> Dongjoon.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <
>>>> [email protected]> wrote:
>>>> >>>>>>>
>>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro
>>>> compatibility problems resolved, e.g.
>>>> >>>>>>>
>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>> >>>>>>>
>>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with
>>>> 1.8.x.  As far as I know, Parquet has not cut a release based on this new
>>>> version.
>>>> >>>>>>>
>>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting
>>>> 3.0?
>>>> >>>>>>>
>>>> >>>>>>> https://github.com/apache/spark/pull/24851
>>>> >>>>>>> https://github.com/apache/spark/pull/24297
>>>> >>>>>>>
>>>> >>>>>>>    michael
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <[email protected]>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> I'm curious what current feelings are about ramping down
>>>> towards a
>>>> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed
>>>> date,
>>>> >>>>>>> though in the past we had informally tossed around "back end of
>>>> 2019".
>>>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
>>>> expect
>>>> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
>>>> coming
>>>> >>>>>>> due.
>>>> >>>>>>>
>>>> >>>>>>> What are the few major items that must get done for Spark 3, in
>>>> your
>>>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> >>>>>>> should feel free to update with things that aren't really
>>>> needed for
>>>> >>>>>>> Spark 3; I already triaged some).
>>>> >>>>>>>
>>>> >>>>>>> For me, it's:
>>>> >>>>>>> - DSv2?
>>>> >>>>>>> - Finishing touches on the Hive, JDK 11 update
>>>> >>>>>>>
>>>> >>>>>>> What about considering a preview release earlier, as happened
>>>> for
>>>> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could
>>>> that
>>>> >>>>>>> even happen ... about now?
>>>> >>>>>>>
>>>> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
>>>> is. My
>>>> >>>>>>> guess is quite early 2020, from here.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and
>>>> session catalog uses
>>>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog
>>>> API
>>>> >>>>>>> SPARK-28588 Build a SQL reference doc
>>>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> >>>>>>> SPARK-28684 Hive module support JDK 11
>>>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted
>>>> DataFrames
>>>> >>>>>>> after some operations
>>>> >>>>>>> SPARK-28372 Document Spark WEB UI
>>>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>> multi-catalog
>>>> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty
>>>> local
>>>> >>>>>>> relation table properly
>>>> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to
>>>> enable
>>>> >>>>>>> smoother upgrade
>>>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm
>>>> when the #
>>>> >>>>>>> of joined tables > 12
>>>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> >>>>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> >>>>>>> SPARK-24625 put all the backward compatible behavior change
>>>> configs
>>>> >>>>>>> under spark.sql.legacy.*
>>>> >>>>>>> SPARK-24640 size(null) returns null
>>>> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>>> operators
>>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch
>>>> failures by default
>>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a
>>>> major
>>>> >>>>>>> efficiency problem
>>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
>>>> backend
>>>> >>>>>>> cause driver pods to hang
>>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale
>>>> configurable
>>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
>>>> containing
>>>> >>>>>>> barrier stage
>>>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
>>>> logical Aggregate
>>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>>> standard
>>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source
>>>> to
>>>> >>>>>>> avoid checkpoint corruption
>>>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> >>>>>>> produce named output from CleanupAliases
>>>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
>>>> window aggregate
>>>> >>>>>>> SPARK-25531 new write APIs for data source v2
>>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
>>>> Mesos
>>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> >>>>>>> MesosFineGrainedSchedulerBackend
>>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
>>>> barrier
>>>> >>>>>>> execution mode
>>>> >>>>>>> SPARK-25390 data source V2 API refactoring
>>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>>>> Partition Spec
>>>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> >>>>>>> SPARK-18245 Improving support for bucketed table
>>>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
>>>> Support in Spark
>>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
>>>> nested
>>>> >>>>>>> list of structures
>>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's
>>>> DataFrame to
>>>> >>>>>>> respect session timezone
>>>> >>>>>>> SPARK-22386 Data Source V2 improvements
>>>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
>>>> YARN
>>>> >>>>>>>
>>>> >>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>> To unsubscribe e-mail: [email protected]
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Name : Jungtaek Lim
>>>> >>>> Blog : http://medium.com/@heartsavior
>>>> >>>> Twitter : http://twitter.com/heartsavior
>>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> John Zhuge
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Twitter: https://twitter.com/holdenkarau
>>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9
>>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
-- 
It's dark in this basement.

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to