Re: Thoughts on Spark 3 release, or a preview release

Xiao Li Tue, 17 Sep 2019 00:00:15 -0700

https://issues.apache.org/jira/browse/SPARK-28264 SPARK-28264 Revisiting
Python / pandas UDF sounds critical for 3.0 preview


Xiao

On Mon, Sep 16, 2019 at 12:22 PM Erik Erlandson <eerla...@redhat.com> wrote:

>
> I'm in favor of adding SPARK-25299
> <https://issues.apache.org/jira/browse/SPARK-25299> - Use remote storage
> for persisting shuffle data
> https://issues.apache.org/jira/browse/SPARK-25299
>
> If that is far enough along to get onto the roadmap.
>
>
> On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <sro...@apache.org> wrote:
>
>> I'm curious what current feelings are about ramping down towards a
>> Spark 3 release. It feels close to ready. There is no fixed date,
>> though in the past we had informally tossed around "back end of 2019".
>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>> due.
>>
>> What are the few major items that must get done for Spark 3, in your
>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>> should feel free to update with things that aren't really needed for
>> Spark 3; I already triaged some).
>>
>> For me, it's:
>> - DSv2?
>> - Finishing touches on the Hive, JDK 11 update
>>
>> What about considering a preview release earlier, as happened for
>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>> even happen ... about now?
>>
>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>> guess is quite early 2020, from here.
>>
>>
>>
>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>> uses
>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>> SPARK-28588 Build a SQL reference doc
>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>> SPARK-28684 Hive module support JDK 11
>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>> after some operations
>> SPARK-28372 Document Spark WEB UI
>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>> SPARK-28264 Revisiting Python / pandas UDF
>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>> SPARK-28155 do not leak SaveMode to file source v2
>> SPARK-28103 Cannot infer filters from union table with empty local
>> relation table properly
>> SPARK-28024 Incorrect numeric values when out of range
>> SPARK-27936 Support local dependency uploading from --py-files
>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>> SPARK-27780 Shuffle server & client should be versioned to enable
>> smoother upgrade
>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>> of joined tables > 12
>> SPARK-27471 Reorganize public v2 catalog API
>> SPARK-27520 Introduce a global config system to replace
>> hadoopConfiguration
>> SPARK-24625 put all the backward compatible behavior change configs
>> under spark.sql.legacy.*
>> SPARK-24640 size(null) returns null
>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>> SPARK-24941 Add RDDBarrier.coalesce() function
>> SPARK-25017 Add test suite for ContextBarrierState
>> SPARK-25083 remove the type erasure hack in data source scan
>> SPARK-25383 Image data source supports sample pushdown
>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>> default
>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>> efficiency problem
>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>> cause driver pods to hang
>> SPARK-26731 remove EOLed spark jobs from jenkins
>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>> SPARK-21559 Remove Mesos fine-grained mode
>> SPARK-24942 Improve cluster resource management with jobs containing
>> barrier stage
>> SPARK-25914 Separate projection from grouping and aggregate in logical
>> Aggregate
>> SPARK-26022 PySpark Comparison with Pandas
>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>> SPARK-26425 Add more constraint checks in file streaming source to
>> avoid checkpoint corruption
>> SPARK-25843 Redesign rangeBetween API
>> SPARK-25841 Redesign window function rangeBetween API
>> SPARK-25752 Add trait to easily whitelist logical operators that
>> produce named output from CleanupAliases
>> SPARK-23210 Introduce the concept of default value to schema
>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>> aggregate
>> SPARK-25531 new write APIs for data source v2
>> SPARK-25547 Pluggable jdbc connection factory
>> SPARK-20845 Support specification of column names in INSERT INTO
>> SPARK-24417 Build and Run Spark on JDK11
>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>> SPARK-25074 Implement maxNumConcurrentTasks() in
>> MesosFineGrainedSchedulerBackend
>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>> SPARK-25186 Stabilize Data Source V2 API
>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>> execution mode
>> SPARK-25390 data source V2 API refactoring
>> SPARK-7768 Make user-defined type (UDT) API public
>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>> Spec
>> SPARK-15691 Refactor and improve Hive support
>> SPARK-15694 Implement ScriptTransformation in sql/core
>> SPARK-16217 Support SELECT INTO statement
>> SPARK-16452 basic INFORMATION_SCHEMA support
>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>> SPARK-18245 Improving support for bucketed table
>> SPARK-19842 Informational Referential Integrity Constraints Support in
>> Spark
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>> list of structures
>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>> respect session timezone
>> SPARK-22386 Data Source V2 improvements
>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to