Thoughts on Spark 3 release, or a preview release

Sean Owen Wed, 11 Sep 2019 11:37:54 -0700

I'm curious what current feelings are about ramping down towards a
Spark 3 release. It feels close to ready. There is no fixed date,
though in the past we had informally tossed around "back end of 2019".
For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
due.


What are the few major items that must get done for Spark 3, in your
opinion? Below are all of the open JIRAs for 3.0 (which everyone
should feel free to update with things that aren't really needed for
Spark 3; I already triaged some).

For me, it's:
- DSv2?
- Finishing touches on the Hive, JDK 11 update

What about considering a preview release earlier, as happened for
Spark 2, to get feedback much earlier than the RC cycle? Could that
even happen ... about now?

I'm also wondering what a realistic estimate of Spark 3 release is. My
guess is quite early 2020, from here.



SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL reference doc
SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
SPARK-28684 Hive module support JDK 11
SPARK-28548 explain() shows wrong result for persisted DataFrames
after some operations
SPARK-28372 Document Spark WEB UI
SPARK-28476 Support ALTER DATABASE SET LOCATION
SPARK-28264 Revisiting Python / pandas UDF
SPARK-28301 fix the behavior of table name resolution with multi-catalog
SPARK-28155 do not leak SaveMode to file source v2
SPARK-28103 Cannot infer filters from union table with empty local
relation table properly
SPARK-28024 Incorrect numeric values when out of range
SPARK-27936 Support local dependency uploading from --py-files
SPARK-27884 Deprecate Python 2 support in Spark 3.0
SPARK-27763 Port test cases from PostgreSQL to Spark SQL
SPARK-27780 Shuffle server & client should be versioned to enable
smoother upgrade
SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
of joined tables > 12
SPARK-27471 Reorganize public v2 catalog API
SPARK-27520 Introduce a global config system to replace hadoopConfiguration
SPARK-24625 put all the backward compatible behavior change configs
under spark.sql.legacy.*
SPARK-24640 size(null) returns null
SPARK-24702 Unable to cast to calendar interval in spark sql.
SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
SPARK-24941 Add RDDBarrier.coalesce() function
SPARK-25017 Add test suite for ContextBarrierState
SPARK-25083 remove the type erasure hack in data source scan
SPARK-25383 Image data source supports sample pushdown
SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
efficiency problem
SPARK-25128 multiple simultaneous job submissions against k8s backend
cause driver pods to hang
SPARK-26731 remove EOLed spark jobs from jenkins
SPARK-26664 Make DecimalType's minimum adjusted scale configurable
SPARK-21559 Remove Mesos fine-grained mode
SPARK-24942 Improve cluster resource management with jobs containing
barrier stage
SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
SPARK-26022 PySpark Comparison with Pandas
SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
SPARK-26221 Improve Spark SQL instrumentation and metrics
SPARK-26425 Add more constraint checks in file streaming source to
avoid checkpoint corruption
SPARK-25843 Redesign rangeBetween API
SPARK-25841 Redesign window function rangeBetween API
SPARK-25752 Add trait to easily whitelist logical operators that
produce named output from CleanupAliases
SPARK-23210 Introduce the concept of default value to schema
SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
SPARK-25531 new write APIs for data source v2
SPARK-25547 Pluggable jdbc connection factory
SPARK-20845 Support specification of column names in INSERT INTO
SPARK-24417 Build and Run Spark on JDK11
SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
SPARK-25074 Implement maxNumConcurrentTasks() in
MesosFineGrainedSchedulerBackend
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-25186 Stabilize Data Source V2 API
SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
execution mode
SPARK-25390 data source V2 API refactoring
SPARK-7768 Make user-defined type (UDT) API public
SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
SPARK-15691 Refactor and improve Hive support
SPARK-15694 Implement ScriptTransformation in sql/core
SPARK-16217 Support SELECT INTO statement
SPARK-16452 basic INFORMATION_SCHEMA support
SPARK-18134 SQL: MapType in Group BY and Joins not working
SPARK-18245 Improving support for bucketed table
SPARK-19842 Informational Referential Integrity Constraints Support in Spark
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
respect session timezone
SPARK-22386 Data Source V2 improvements
SPARK-24723 Discuss necessary info and access in barrier mode + YARN

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Thoughts on Spark 3 release, or a preview release

Reply via email to