I'd like to add SPARK-24296, replicating large blocks over 2GB. Its been up for review for a while, and would end the 2GB block limit (well ... subject to a couple of caveats on SPARK-6235).
On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > I went through the open JIRA tickets and here is a list that we should > consider for Spark 2.4: > > *High Priority*: > SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>: Support > Barrier Execution Mode in Apache Spark > This one is critical to the Spark ecosystem for deep learning. It only has > a few remaining works and I think we should have it in Spark 2.4. > > *Middle Priority*: > SPARK-23899 <https://issues.apache.org/jira/browse/SPARK-23899>: Built-in > SQL Function Improvement > We've already added a lot of built-in functions in this release, but there > are a few useful higher-order functions in progress, like `array_except`, > `transform`, etc. It would be great if we can get them in Spark 2.4. > > SPARK-14220 <https://issues.apache.org/jira/browse/SPARK-14220>: Build > and test Spark against Scala 2.12 > Very close to finishing, great to have it in Spark 2.4. > > SPARK-4502 <https://issues.apache.org/jira/browse/SPARK-4502>: Spark SQL > reads unnecessary nested fields from Parquet > This one is there for years (thanks for your patience Michael!), and is > also close to finishing. Great to have it in 2.4. > > SPARK-24882 <https://issues.apache.org/jira/browse/SPARK-24882>: data > source v2 API improvement > This is to improve the data source v2 API based on what we learned during > this release. From the migration of existing sources and design of new > features, we found some problems in the API and want to address them. I > believe this should be the last significant API change to data source > v2, so great to have in Spark 2.4. I'll send a discuss email about it later. > > SPARK-24252 <https://issues.apache.org/jira/browse/SPARK-24252>: Add > catalog support in Data Source V2 > This is a very important feature for data source v2, and is currently > being discussed in the dev list. > > SPARK-24768 <https://issues.apache.org/jira/browse/SPARK-24768>: Have a > built-in AVRO data source implementation > Most of it is done, but date/timestamp support is still missing. Great to > have in 2.4. > > SPARK-23243 <https://issues.apache.org/jira/browse/SPARK-23243>: > Shuffle+Repartition on an RDD could lead to incorrect answers > This is a long-standing correctness bug, great to have in 2.4. > > There are some other important features like the adaptive execution, > streaming SQL, etc., not in the list, since I think we are not able to > finish them before 2.4. > > Feel free to add more things if you think they are important to Spark 2.4 > by replying to this email. > > Thanks, > Wenchen > > On Mon, Jul 30, 2018 at 11:00 PM Sean Owen <sro...@apache.org> wrote: > >> In theory releases happen on a time-based cadence, so it's pretty much >> wrap up what's ready by the code freeze and ship it. In practice, the >> cadence slips frequently, and it's very much a negotiation about what >> features should push the code freeze out a few weeks every time. So, kind >> of a hybrid approach here that works OK. >> >> Certainly speak up if you think there's something that really needs to >> get into 2.4. This is that discuss thread. >> >> (BTW I updated the page you mention just yesterday, to reflect the plan >> suggested in this thread.) >> >> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves <tgraves...@yahoo.com.invalid> >> wrote: >> >>> Shouldn't this be a discuss thread? >>> >>> I'm also happy to see more release managers and agree the time is >>> getting close, but we should see what features are in progress and see how >>> close things are and propose a date based on that. Cutting a branch to >>> soon just creates more work for committers to push to more branches. >>> >>> http://spark.apache.org/versioning-policy.html mentioned the code >>> freeze and release branch cut mid-august. >>> >>> >>> Tom >>> >>>