Thank you :) Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <gurwls...@apache.org> wrote: > Reverted, and opened a new one https://github.com/apache/spark/pull/47341. > > On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote: > >> Yeah that's fine. I'll revert and open a fresh PR including my own >> followup when I get back home later today. >> >> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com> >> wrote: >> >>> Even if the change is reasonable (and I can see arguments both ways), >>> it's important that we follow the process we agreed on. Merging a PR >>> without discussion* in ~ 2 hours from the initial proposal is not enough >>> time to reach a lazy consensus. If it was a small bug-fix I could >>> understand but this was a non-trivial change. >>> >>> >>> * It was approved by another committer but without any discussion, and >>> the approver & code author work for the same employer mentioned as the >>> justification for the change. >>> >>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> >>> wrote: >>> >>>> I think we should have not mentioned a specific vendor there. The >>>> change also shouldn't repartition. We should create a partition 1. >>>> >>>> But in general leveraging Catalyst optimizer and SQL engine there is a >>>> good idea as we can leverage all optimization there. For example, it will >>>> use UTF8 encoding instead of a plan string ser/de. We made similar changes >>>> in JSON and CSV schema inference (it was an RDD before) >>>> >>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> >>>> wrote: >>>> >>>>> My bad I meant to say I believe the provided justification is >>>>> inappropriate. >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>>> >>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> So looking at the PR it does not appear to be removing any RDD APIs >>>>>> but the justification provided for changing the ML backend to use the >>>>>> DataFrame APIs is indeed concerning. >>>>>> >>>>>> This PR appears to have been merged without proper review (or >>>>>> providing an opportunity for review). >>>>>> >>>>>> I’d like to remind people of the expectations we decided on together >>>>>> — >>>>>> https://spark.apache.org/committers.html >>>>>> >>>>>> I believe the provided justification for the change and would ask >>>>>> that we revert this PR so that a proper discussion can take place. >>>>>> >>>>>> “ >>>>>> In databricks runtime, RDD read / write API has some issue for >>>>>> certain storage types that requires the account key, but Dataframe read / >>>>>> write API works. >>>>>> “ >>>>>> >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>> >>>>>> >>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>>>>> <mar...@databricks.com.invalid> wrote: >>>>>> >>>>>>> I took a quick look at the PR and would like to understand your >>>>>>> concern better about: >>>>>>> >>>>>>> > SparkSession is heavier than SparkContext >>>>>>> >>>>>>> It looks like the PR is using the active SparkSession, not creating >>>>>>> a new one etc. I would highly appreciate it if you could help me >>>>>>> understand >>>>>>> this situation better. >>>>>>> >>>>>>> Thanks a lot! >>>>>>> >>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun < >>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, All. >>>>>>>> >>>>>>>> Apache Spark's RDD API plays an essential and invaluable role from >>>>>>>> the beginning and it will be even if it's not supported by Spark >>>>>>>> Connect. >>>>>>>> >>>>>>>> I have a concern about a recent activity which replaces RDD with >>>>>>>> SparkSession blindly. >>>>>>>> >>>>>>>> For instance, >>>>>>>> >>>>>>>> https://github.com/apache/spark/pull/47328 >>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>>>>> Dataframe read / write API >>>>>>>> >>>>>>>> This PR doesn't look proper to me in two ways. >>>>>>>> - SparkSession is heavier than SparkContext >>>>>>>> - According to the following PR description, the background is also >>>>>>>> hidden in the community. >>>>>>>> >>>>>>>> > # Why are the changes needed? >>>>>>>> > In databricks runtime, RDD read / write API has some issue for >>>>>>>> certain storage types >>>>>>>> > that requires the account key, but Dataframe read / write API >>>>>>>> works. >>>>>>>> >>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>>>>> storage's issue or not because it's not testable in the community test >>>>>>>> coverage. >>>>>>>> >>>>>>>> I'm wondering if the Apache Spark community aims to move away from >>>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because >>>>>>>> `Spark Connect` is not even GA in the community? >>>>>>>> >>>>>>>> Dongjoon. >>>>>>>> >>>>>>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >>