+1 as a contributor and as a user. Given the amount of testing required for all the new cool stuff like java 11 support, major refactorings/deprecations etc, a preview version would help a lot the community making adoption smoother long term. I would also add to the list of issues, Scala 2.13 support ( https://issues.apache.org/jira/browse/SPARK-25075) assuming things will move forward faster the next few months.
On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko <fo...@driesprong.frl> wrote: > Michael Heuer, that's an interesting issue. > > 1.8.2 to 1.9.0 is almost binary compatible (94%): > http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html. > Most of the stuff is removing the Jackson and Netty API from Avro's public > API and deprecating the Joda library. I would strongly advise moving to > 1.9.1 since there are some regression issues, for Java most important: > https://jira.apache.org/jira/browse/AVRO-2400 > > I'd love to dive into the issue that you describe and I'm curious if the > issue is still there with Avro 1.9.1. I'm a bit busy at the moment but > might have some time this weekend to dive into it. > > Cheers, Fokko Driesprong > > > Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <r...@databricks.com>: > >> +1! Long due for a preview release. >> >> >> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> I like the idea from the PoV of giving folks something to start testing >>> against and exploring so they can raise issues with us earlier in the >>> process and we have more time to make calls around this. >>> >>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jzh...@apache.org> wrote: >>> >>> +1 Like the idea as a user and a DSv2 contributor. >>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <kabh...@gmail.com> wrote: >>> >>> +1 (as a contributor) from me to have preview release on Spark 3 as it >>> would help to test the feature. When to cut preview release is >>> questionable, as major works are ideally to be done before that - if we are >>> intended to introduce new features before official release, that should >>> work regardless of this, but if we are intended to have opportunity to test >>> earlier, ideally it should. >>> >>> As a one of contributors in structured streaming area, I'd like to add >>> some items for Spark 3.0, both "must be done" and "better to have". For >>> "better to have", I pick some items for new features which committers >>> reviewed couple of rounds and dropped off without soft-reject (No valid >>> reason to stop). For Spark 2.4 users, only added feature for structured >>> streaming is Kafka delegation token. (given we assume revising Kafka >>> consumer pool as improvement) I hope we provide some gifts for structured >>> streaming users in Spark 3.0 envelope. >>> >>> > must be done >>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent >>> output >>> It's a correctness issue with multiple users reported, being reported at >>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch >>> submitted at Jan. 2019 to fix it. >>> >>> > better to have >>> * SPARK-23539 Add support for Kafka headers in Structured Streaming >>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp >>> to start and end offset >>> * SPARK-20568 Delete files after processing in structured streaming >>> >>> There're some more new features/improvements items in SS, but given >>> we're talking about ramping-down, above list might be realistic one. >>> >>> >>> >>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <j...@jgp.net> wrote: >>> >>> As a user/non committer, +1 >>> >>> I love the idea of an early 3.0.0 so we can test current dev against it, >>> I know the final 3.x will probably need another round of testing when it >>> gets out, but less for sure... I know I could checkout and compile, but >>> having a “packaged” preversion is great if it does not take too much time >>> to the team... >>> >>> jg >>> >>> >>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gurwls...@gmail.com> wrote: >>> >>> +1 from me too but I would like to know what other people think too. >>> >>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <dongjoon.h...@gmail.com>님이 작성: >>> >>> Thank you, Sean. >>> >>> I'm also +1 for the following three. >>> >>> 1. Start to ramp down (by the official branch-3.0 cut) >>> 2. Apache Spark 3.0.0-preview in 2019 >>> 3. Apache Spark 3.0.0 in early 2020 >>> >>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps >>> it a lot. >>> >>> After this discussion, can we have some timeline for `Spark 3.0 Release >>> Window` in our versioning-policy page? >>> >>> - https://spark.apache.org/versioning-policy.html >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <heue...@gmail.com> >>> wrote: >>> >>> I would love to see Spark + Hadoop + Parquet + Avro compatibility >>> problems resolved, e.g. >>> >>> https://issues.apache.org/jira/browse/SPARK-25588 >>> https://issues.apache.org/jira/browse/SPARK-27781 >>> >>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x. As far >>> as I know, Parquet has not cut a release based on this new version. >>> >>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0? >>> >>> https://github.com/apache/spark/pull/24851 >>> https://github.com/apache/spark/pull/24297 >>> >>> michael >>> >>> >>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sro...@apache.org> wrote: >>> >>> I'm curious what current feelings are about ramping down towards a >>> Spark 3 release. It feels close to ready. There is no fixed date, >>> though in the past we had informally tossed around "back end of 2019". >>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect >>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming >>> due. >>> >>> What are the few major items that must get done for Spark 3, in your >>> opinion? Below are all of the open JIRAs for 3.0 (which everyone >>> should feel free to update with things that aren't really needed for >>> Spark 3; I already triaged some). >>> >>> For me, it's: >>> - DSv2? >>> - Finishing touches on the Hive, JDK 11 update >>> >>> What about considering a preview release earlier, as happened for >>> Spark 2, to get feedback much earlier than the RC cycle? Could that >>> even happen ... about now? >>> >>> I'm also wondering what a realistic estimate of Spark 3 release is. My >>> guess is quite early 2020, from here. >>> >>> >>> >>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog >>> uses >>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests >>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite >>> SPARK-28717 Update SQL ALTER TABLE RENAME to use TableCatalog API >>> SPARK-28588 Build a SQL reference doc >>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder >>> SPARK-28684 Hive module support JDK 11 >>> SPARK-28548 explain() shows wrong result for persisted DataFrames >>> after some operations >>> SPARK-28372 Document Spark WEB UI >>> SPARK-28476 Support ALTER DATABASE SET LOCATION >>> SPARK-28264 Revisiting Python / pandas UDF >>> SPARK-28301 fix the behavior of table name resolution with multi-catalog >>> SPARK-28155 do not leak SaveMode to file source v2 >>> SPARK-28103 Cannot infer filters from union table with empty local >>> relation table properly >>> SPARK-28024 Incorrect numeric values when out of range >>> SPARK-27936 Support local dependency uploading from --py-files >>> SPARK-27884 Deprecate Python 2 support in Spark 3.0 >>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL >>> SPARK-27780 Shuffle server & client should be versioned to enable >>> smoother upgrade >>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the # >>> of joined tables > 12 >>> SPARK-27471 Reorganize public v2 catalog API >>> SPARK-27520 Introduce a global config system to replace >>> hadoopConfiguration >>> SPARK-24625 put all the backward compatible behavior change configs >>> under spark.sql.legacy.* >>> SPARK-24640 size(null) returns null >>> SPARK-24702 Unable to cast to calendar interval in spark sql. >>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators >>> SPARK-24941 Add RDDBarrier.coalesce() function >>> SPARK-25017 Add test suite for ContextBarrierState >>> SPARK-25083 remove the type erasure hack in data source scan >>> SPARK-25383 Image data source supports sample pushdown >>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by >>> default >>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major >>> efficiency problem >>> SPARK-25128 multiple simultaneous job submissions against k8s backend >>> cause driver pods to hang >>> SPARK-26731 remove EOLed spark jobs from jenkins >>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable >>> SPARK-21559 Remove Mesos fine-grained mode >>> SPARK-24942 Improve cluster resource management with jobs containing >>> barrier stage >>> SPARK-25914 Separate projection from grouping and aggregate in logical >>> Aggregate >>> SPARK-26022 PySpark Comparison with Pandas >>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard >>> SPARK-26221 Improve Spark SQL instrumentation and metrics >>> SPARK-26425 Add more constraint checks in file streaming source to >>> avoid checkpoint corruption >>> SPARK-25843 Redesign rangeBetween API >>> SPARK-25841 Redesign window function rangeBetween API >>> SPARK-25752 Add trait to easily whitelist logical operators that >>> produce named output from CleanupAliases >>> SPARK-23210 Introduce the concept of default value to schema >>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window >>> aggregate >>> SPARK-25531 new write APIs for data source v2 >>> SPARK-25547 Pluggable jdbc connection factory >>> SPARK-20845 Support specification of column names in INSERT INTO >>> SPARK-24417 Build and Run Spark on JDK11 >>> SPARK-24724 Discuss necessary info and access in barrier mode + >>> Kubernetes >>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos >>> SPARK-25074 Implement maxNumConcurrentTasks() in >>> MesosFineGrainedSchedulerBackend >>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>> SPARK-25186 Stabilize Data Source V2 API >>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier >>> execution mode >>> SPARK-25390 data source V2 API refactoring >>> SPARK-7768 Make user-defined type (UDT) API public >>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition >>> Spec >>> SPARK-15691 Refactor and improve Hive support >>> SPARK-15694 Implement ScriptTransformation in sql/core >>> SPARK-16217 Support SELECT INTO statement >>> SPARK-16452 basic INFORMATION_SCHEMA support >>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>> SPARK-18245 Improving support for bucketed table >>> SPARK-19842 Informational Referential Integrity Constraints Support in >>> Spark >>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested >>> list of structures >>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to >>> respect session timezone >>> SPARK-22386 Data Source V2 improvements >>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> <dev-unsubscr...@spark.apache.org> >>> >>> >>> >>> >>> -- >>> Name : Jungtaek Lim >>> Blog : http://medium.com/@heartsavior >>> Twitter : http://twitter.com/heartsavior >>> LinkedIn : http://www.linkedin.com/in/heartsavior >>> >>> >>> >>> -- >>> John Zhuge >>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> >>