Hi all, Here is the updated feature list:
SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple columns support added to various Transformers: StringIndexer SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement Dynamic Partition Pruning SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support Tree-Based Feature Transformation SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add MultilabelClassificationEvaluator SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add sample weights to decision trees SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for Power Iteration Clustering SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve logic for timing out executors in dynamic allocation SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate unnecessary shuffle with adjacent Window expressions SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire new executors to avoid hang because of blacklisting SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple columns support added to various Transformers: PySpark QuantileDiscretizer SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new approach to do adaptive execution in Spark SQL SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply custom log URL pattern for executor log URLs in SHS SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add support for Kafka headers SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark ML Listener for Tracking ML Pipeline Status SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit with validation set to Gradient Boosted Trees: Python API SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and Run Spark on JDK11 SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> Accelerator-aware task scheduling for Spark SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow sharing Netty's memory pool allocators SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support rolling back a shuffle map stage and re-generate the shuffle files SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data source for binary files SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data source V2 API refactoring SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add Kafka delegation token support SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> Generalize Nested Column Pruning SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove support for Scala 2.11 in Spark 3.0.0 SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define reserved keywords after SQL standard SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow Pandas UDF to take an iterator of pd.DataFrames SPARK-26651 <https://issues.apache.org/jira/browse/SPARK-26651> Use Proleptic Gregorian calendar SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow optimization in SparkR's interoperability SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce new option to Kafka source: offset by timestamp (starting/ending) SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create StreamingWrite at the beginning of streaming execution SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not infer schema when reading Hive serde table with native data source SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement join strategy hints SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use pandas DataFrame for struct type argument in Scalar Pandas UDF SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public APIs for extended Columnar Processing Support SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support Dataframe Cogroup via Pandas UDFs SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> Re-implement file sources with data source V2 API SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially push down disjunctive predicated in Parquet/ORC SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test cases from PostgreSQL to Spark SQL SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate Python 2 support SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert applicable *.sql tests into UDF integrated test base SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow dynamic allocation without an external shuffle service SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust post shuffle partition number in adaptive execution SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move Trigger implementations to Triggers.scala and avoid exposing these to the end users SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document Spark WEB UI SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> RobustScaler feature transformer SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata Handling in Thrift Server SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a SQL reference doc SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve test coverage of ThriftServer SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> Dynamically reuse subqueries in AQE SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove outdated Experimental, Evolving annotations SPARK-29345 <https://issues.apache.org/jira/browse/SPARK-29345> Add an API that allows a user to define and observe arbitrary metrics on streaming queries SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove deprecated items since <= 2.2.0 Cheers, Xingbo Sean Owen <sro...@gmail.com> 于2019年10月10日周四 下午12:50写道: > See the JIRA - this is too open-ended and not obviously just due to > choices in data representation, what you're trying to do, etc. It's > correctly closed IMHO. > However, identifying the issue more narrowly, and something that looks > ripe for optimization, would be useful. > > On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <antonkul...@gmail.com> > wrote: > > > > I think for sure SPARK-28547 > > <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547> > > At the moment there are some flows in Spark architecture and it performs > > miserably or even freezes everywhere where column number exceeds 10-15K > > (even simple describe function takes ages while the same functions with > > pandas and no Spark take seconds). In many fields (like bioinformatics) > wide > > datasets with both large numbers of rows and columns are very common > (gene > > expression data is a good example here) and Spark is totally useless > there. > > > > > > > > -- > > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >