Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition
works for deterministic data. With non-deterministic data, I didn't find an
API to pass DeterministicLevel to underlying rdd.
Do you plan to continue work on integration with SQL operators? If not, I'm
available to take a stab.

On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> We fixed the repartition correctness bug before, by sorting the data
> before doing round-robin partitioning. But the issue is that we need to
> propagate the isDeterministic property through SQL operators.
>
> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu <jasonxu.sp...@gmail.com> wrote:
>
>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>> repartition(numPartitions: Int) API implementation? If that's the direction
>> we're considering, before we have a new implementation, should we suggest
>> users avoid using the repartition(numPartitions: Int) API?
>>
>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> This is why RoundRobinPartitioning shouldn't be used ...
>>>
>>>
>>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu <jasonxu.sp...@gmail.com>
>>> wrote:
>>>
>>>> Hi Spark community,
>>>>
>>>> I reported a data correctness issue in
>>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>>> non-deterministic data + Repartition + FetchFailure could result in
>>>> incorrect data, this is an issue we run into in production pipelines, I
>>>> have an example to reproduce the bug in the ticket.
>>>>
>>>> I report here to bring more attention, could you help confirm it's a
>>>> bug and worth effort to further investigate and fix, thank you in advance
>>>> for help!
>>>>
>>>> Thanks,
>>>> Jason Xu
>>>>
>>>
>>>

Reply via email to