+1 as both a contributor and a user.
From: John Zhuge <jzh...@apache.org> Date: Thursday, September 12, 2019 at 4:15 PM To: Jungtaek Lim <kabh...@gmail.com> Cc: Jean Georges Perrin <j...@jgp.net>, Hyukjin Kwon <gurwls...@gmail.com>, Dongjoon Hyun <dongjoon.h...@gmail.com>, dev <dev@spark.apache.org> Subject: Re: Thoughts on Spark 3 release, or a preview release +1 Like the idea as a user and a DSv2 contributor. On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <kabh...@gmail.com> wrote: +1 (as a contributor) from me to have preview release on Spark 3 as it would help to test the feature. When to cut preview release is questionable, as major works are ideally to be done before that - if we are intended to introduce new features before official release, that should work regardless of this, but if we are intended to have opportunity to test earlier, ideally it should. As a one of contributors in structured streaming area, I'd like to add some items for Spark 3.0, both "must be done" and "better to have". For "better to have", I pick some items for new features which committers reviewed couple of rounds and dropped off without soft-reject (No valid reason to stop). For Spark 2.4 users, only added feature for structured streaming is Kafka delegation token. (given we assume revising Kafka consumer pool as improvement) I hope we provide some gifts for structured streaming users in Spark 3.0 envelope. > must be done * SPARK-26154 Stream-stream joins - left outer join gives inconsistent output It's a correctness issue with multiple users reported, being reported at Nov. 2018. There's a way to reproduce it consistently, and we have a patch submitted at Jan. 2019 to fix it. > better to have * SPARK-23539 Add support for Kafka headers in Structured Streaming * SPARK-26848 Introduce new option to Kafka source - specify timestamp to start and end offset * SPARK-20568 Delete files after processing in structured streaming There're some more new features/improvements items in SS, but given we're talking about ramping-down, above list might be realistic one. On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <j...@jgp.net> wrote: As a user/non committer, +1 I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it does not take too much time to the team... jg On Sep 11, 2019, at 20:40, Hyukjin Kwon <gurwls...@gmail.com> wrote: +1 from me too but I would like to know what other people think too. 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <dongjoon.h...@gmail.com>님이 작성: Thank you, Sean. I'm also +1 for the following three. 1. Start to ramp down (by the official branch-3.0 cut) 2. Apache Spark 3.0.0-preview in 2019 3. Apache Spark 3.0.0 in early 2020 For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a lot. After this discussion, can we have some timeline for `Spark 3.0 Release Window` in our versioning-policy page? - https://spark.apache.org/versioning-policy.html [spark.apache.org] Bests, Dongjoon. On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <heue...@gmail.com> wrote: I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g. https://issues.apache.org/jira/browse/SPARK-25588 [issues.apache.org] https://issues.apache.org/jira/browse/SPARK-27781 [issues.apache.org] Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x. As far as I know, Parquet has not cut a release based on this new version. Then out of curiosity, are the new Spark Graph APIs targeting 3.0? https://github.com/apache/spark/pull/24851 [github.com] https://github.com/apache/spark/pull/24297 [github.com] michael On Sep 11, 2019, at 1:37 PM, Sean Owen <sro...@apache.org> wrote: I'm curious what current feelings are about ramping down towards a Spark 3 release. It feels close to ready. There is no fixed date, though in the past we had informally tossed around "back end of 2019". For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect Spark 2 to last longer, so to speak, but feels like Spark 3 is coming due. What are the few major items that must get done for Spark 3, in your opinion? Below are all of the open JIRAs for 3.0 (which everyone should feel free to update with things that aren't really needed for Spark 3; I already triaged some). For me, it's: - DSv2? - Finishing touches on the Hive, JDK 11 update What about considering a preview release earlier, as happened for Spark 2, to get feedback much earlier than the RC cycle? Could that even happen ... about now? I'm also wondering what a realistic estimate of Spark 3 release is. My guess is quite early 2020, from here. SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite SPARK-28717 Update SQL ALTER TABLE RENAME to use TableCatalog API SPARK-28588 Build a SQL reference doc SPARK-28629 Capture the missing rules in HiveSessionStateBuilder SPARK-28684 Hive module support JDK 11 SPARK-28548 explain() shows wrong result for persisted DataFrames after some operations SPARK-28372 Document Spark WEB UI SPARK-28476 Support ALTER DATABASE SET LOCATION SPARK-28264 Revisiting Python / pandas UDF SPARK-28301 fix the behavior of table name resolution with multi-catalog SPARK-28155 do not leak SaveMode to file source v2 SPARK-28103 Cannot infer filters from union table with empty local relation table properly SPARK-28024 Incorrect numeric values when out of range SPARK-27936 Support local dependency uploading from --py-files SPARK-27884 Deprecate Python 2 support in Spark 3.0 SPARK-27763 Port test cases from PostgreSQL to Spark SQL SPARK-27780 Shuffle server & client should be versioned to enable smoother upgrade SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12 SPARK-27471 Reorganize public v2 catalog API SPARK-27520 Introduce a global config system to replace hadoopConfiguration SPARK-24625 put all the backward compatible behavior change configs under spark.sql.legacy.* SPARK-24640 size(null) returns null SPARK-24702 Unable to cast to calendar interval in spark sql. SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators SPARK-24941 Add RDDBarrier.coalesce() function SPARK-25017 Add test suite for ContextBarrierState SPARK-25083 remove the type erasure hack in data source scan SPARK-25383 Image data source supports sample pushdown SPARK-27272 Enable blacklisting of node/executor on fetch failures by default SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major efficiency problem SPARK-25128 multiple simultaneous job submissions against k8s backend cause driver pods to hang SPARK-26731 remove EOLed spark jobs from jenkins SPARK-26664 Make DecimalType's minimum adjusted scale configurable SPARK-21559 Remove Mesos fine-grained mode SPARK-24942 Improve cluster resource management with jobs containing barrier stage SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate SPARK-26022 PySpark Comparison with Pandas SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard SPARK-26221 Improve Spark SQL instrumentation and metrics SPARK-26425 Add more constraint checks in file streaming source to avoid checkpoint corruption SPARK-25843 Redesign rangeBetween API SPARK-25841 Redesign window function rangeBetween API SPARK-25752 Add trait to easily whitelist logical operators that produce named output from CleanupAliases SPARK-23210 Introduce the concept of default value to schema SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate SPARK-25531 new write APIs for data source v2 SPARK-25547 Pluggable jdbc connection factory SPARK-20845 Support specification of column names in INSERT INTO SPARK-24417 Build and Run Spark on JDK11 SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes SPARK-24725 Discuss necessary info and access in barrier mode + Mesos SPARK-25074 Implement maxNumConcurrentTasks() in MesosFineGrainedSchedulerBackend SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 SPARK-25186 Stabilize Data Source V2 API SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier execution mode SPARK-25390 data source V2 API refactoring SPARK-7768 Make user-defined type (UDT) API public SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec SPARK-15691 Refactor and improve Hive support SPARK-15694 Implement ScriptTransformation in sql/core SPARK-16217 Support SELECT INTO statement SPARK-16452 basic INFORMATION_SCHEMA support SPARK-18134 SQL: MapType in Group BY and Joins not working SPARK-18245 Improving support for bucketed table SPARK-19842 Informational Referential Integrity Constraints Support in Spark SPARK-22231 Support of map, filter, withColumn, dropColumn in nested list of structures SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to respect session timezone SPARK-22386 Data Source V2 improvements SPARK-24723 Discuss necessary info and access in barrier mode + YARN --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Name : Jungtaek Lim Blog : http://medium.com/@heartsavior [medium.com] Twitter : http://twitter.com/heartsavior [twitter.com] LinkedIn : http://www.linkedin.com/in/heartsavior [linkedin.com] -- John Zhuge
smime.p7s
Description: S/MIME cryptographic signature