Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
Thank you so much, Takeshi! Takeshi Yamamuro wrote > Hi, viirya > > I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a > malformed class name error > on jdk8u" . > > Bests, > Takeshi > > > Takeshi Yamamuro -- Sent from:

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Takeshi Yamamuro
Hi, viirya I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u" . Bests, Takeshi On Tue, Mar 16, 2021 at 4:45 AM Liang-Chi Hsieh wrote: > To update with current status. > > There are three tickets targeting 2.4 that are still ongoing. > >

Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is Structured Streaming, micro-batch mode. From the example, the abstraction in 2 isn't something working with Structured Streaming. It could be still done with callback, but it remains the question how much complexity is hidden from

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Liang-Chi Hsieh
To update with current status. There are three tickets targeting 2.4 that are still ongoing. SPARK-34719: Correctly resolve the view query with duplicated column names SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u SPARK-34726: Fix collectToPython timeouts

[RESULT] [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
This SPIP is adopted with the following +1 votes and no -1 or +0 votes: Holden Karau* John Zhuge Chao Sun Dongjoon Hyun* Russell Spitzer DB Tsai* Wenchen Fan* Kent Yao Huaxin Gao Liang-Chi Hsieh Jungtaek Lim Hyukjin Kwon* Gengliang Wang kordex Takeshi Yamamuro Ryan Blue * = binding On Mon, Mar

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
And a late +1 from me. On Fri, Mar 12, 2021 at 5:46 AM Takeshi Yamamuro wrote: > +1, too. > > On Fri, Mar 12, 2021 at 8:51 PM kordex wrote: > >> +1 (for what it's worth). It will definitely help our efforts. >> >> On Fri, Mar 12, 2021 at 12:14 PM Gengliang Wang wrote: >> > >> > +1

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Ismaël Mejía
+1 Bringing a Pandas API for pyspark to upstream Spark will only bring benefits for everyone (more eyes to use/see/fix/improve the API) as well as better alignment with core Spark improvements, the extra weight looks manageable. On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas wrote: > > On

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Nicholas Chammas
On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin wrote: > I don't think we should deprecate existing APIs. > +1 I strongly prefer Spark's immutable DataFrame API to the Pandas API. I could be wrong, but I wager most people who have worked with both Spark and Pandas feel the same way. For the large

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Maciej
I concur. These two don't have the same target audience or expressiveness. I cannot imagine most of the PySpark projects I've seen to switch to Pandas-style API. If this is to be included, it would be great if we could model similar to SQLAlchemy, with its core and ORM components being equally

Observable Metrics on Spark Datasets

2021-03-15 Thread Enrico Minack
Hi Spark-Devs, the observable metrics that have been added to the Dataset API in 3.0.0 are a great improvement over the Accumulator APIs that seem to have much weaker guarantees. I have two questions regarding follow-up contributions: *1. Add observe to Python ***DataFrame** As I can see

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Reynold Xin
I don't think we should deprecate existing APIs. Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers. pandas API is a great for data science, but isn't that great for some

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-15 Thread Dongjoon Hyun
Thank you for the proposal. It looks like a good addition. BTW, what is the future plan for the existing APIs? Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)? > Fourthly, PySpark is still not Pythonic enough. For example, I hear