Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Holden Karau Sat, 13 Jul 2024 09:08:28 -0700

Thank you :)

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau



On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <[email protected]> wrote:

> Reverted, and opened a new one https://github.com/apache/spark/pull/47341.
>
> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <[email protected]> wrote:
>
>> Yeah that's fine. I'll revert and open a fresh PR including my own
>> followup when I get back home later today.
>>
>> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <[email protected]>
>> wrote:
>>
>>> Even if the change is reasonable (and I can see arguments both ways),
>>> it's important that we follow the process we agreed on. Merging a PR
>>> without discussion* in ~ 2 hours from the initial proposal is not enough
>>> time to reach a lazy consensus. If it was a small bug-fix I could
>>> understand but this was a non-trivial change.
>>>
>>>
>>> * It was approved by another committer but without any discussion, and
>>> the approver & code author work for the same employer mentioned as the
>>> justification for the change.
>>>
>>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <[email protected]>
>>> wrote:
>>>
>>>> I think we should have not mentioned a specific vendor there. The
>>>> change also shouldn't repartition. We should create a partition 1.
>>>>
>>>> But in general leveraging Catalyst optimizer and SQL engine there is a
>>>> good idea as we can leverage all optimization there. For example, it will
>>>> use UTF8 encoding instead of a plan string ser/de. We made similar changes
>>>> in JSON and CSV schema inference (it was an RDD before)
>>>>
>>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> My bad I meant to say I believe the provided justification is
>>>>> inappropriate.
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>>
>>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> So looking at the PR it does not appear to be removing any RDD APIs
>>>>>> but the justification provided for changing the ML backend to use the
>>>>>> DataFrame APIs is indeed concerning.
>>>>>>
>>>>>> This PR appears to have been merged without proper review (or
>>>>>> providing an opportunity for review).
>>>>>>
>>>>>> I’d like to remind people of the expectations we decided on together
>>>>>> —
>>>>>> https://spark.apache.org/committers.html
>>>>>>
>>>>>> I believe the provided justification for the change and would ask
>>>>>> that we revert this PR so that a proper discussion can take place.
>>>>>>
>>>>>> “
>>>>>> In databricks runtime, RDD read / write API has some issue for
>>>>>> certain storage types that requires the account key, but Dataframe read /
>>>>>> write API works.
>>>>>> “
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> I took a quick look at the PR and would like to understand your
>>>>>>> concern better about:
>>>>>>>
>>>>>>> >  SparkSession is heavier than SparkContext
>>>>>>>
>>>>>>> It looks like the PR is using the active SparkSession, not creating
>>>>>>> a new one etc. I would highly appreciate it if you could help me 
>>>>>>> understand
>>>>>>> this situation better.
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi, All.
>>>>>>>>
>>>>>>>> Apache Spark's RDD API plays an essential and invaluable role from
>>>>>>>> the beginning and it will be even if it's not supported by Spark 
>>>>>>>> Connect.
>>>>>>>>
>>>>>>>> I have a concern about a recent activity which replaces RDD with
>>>>>>>> SparkSession blindly.
>>>>>>>>
>>>>>>>> For instance,
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/47328
>>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>>>>> Dataframe read / write API
>>>>>>>>
>>>>>>>> This PR doesn't look proper to me in two ways.
>>>>>>>> - SparkSession is heavier than SparkContext
>>>>>>>> - According to the following PR description, the background is also
>>>>>>>> hidden in the community.
>>>>>>>>
>>>>>>>>   > # Why are the changes needed?
>>>>>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>>>>>> certain storage types
>>>>>>>>   > that requires the account key, but Dataframe read / write API
>>>>>>>> works.
>>>>>>>>
>>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>>>>> storage's issue or not because it's not testable in the community test
>>>>>>>> coverage.
>>>>>>>>
>>>>>>>> I'm wondering if the Apache Spark community aims to move away from
>>>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because
>>>>>>>> `Spark Connect` is not even GA in the community?
>>>>>>>>
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to