Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Jason Xu
Hi Prem,

>From the symptom of shuffle fetch failure and few duplicate data and few
missing data, I think you might run into this correctness bug:
https://issues.apache.org/jira/browse/SPARK-38388.

Node/shuffle failure is hard to avoid, I wonder if you have
non-deterministic logic and calling repartition() (round robin
partitioning) in your code? If you can avoid either of these, you can avoid
the issue from happening for now. To root fix the issue, it requires a
non-trivial effort, I don't think there's a solution available yet.

I have heard that there are community efforts to solve this issue, but I
lack detailed information. Hopefully, someone with more knowledge can
provide further insight.

Best,
Jason

On Mon, Mar 4, 2024 at 9:41 AM Prem Sahoo  wrote:

> super :(
>
> On Mon, Mar 4, 2024 at 6:19 AM Mich Talebzadeh 
> wrote:
>
>> "... in a nutshell  if fetchFailedException occurs due to data node
>> reboot then it  can create duplicate / missing data  .   so this is more of
>> hardware(env issue ) rather than spark issue ."
>>
>> As an overall conclusion your point is correct but again the answer is
>> not binary.
>>
>> Spark core relies on a distributed file system to store data across data
>> nodes. When Spark needs to process data, it fetches the required blocks
>> from the data nodes.* FetchFailedException*: means  that Spark
>> encountered an error while fetching data blocks from a data node. If a data
>> node reboots unexpectedly, it becomes unavailable to Spark for a
>> period. During this time, Spark might attempt to fetch data blocks from the
>> unavailable node, resulting in the FetchFailedException.. Depending on the
>> timing and nature of the reboot and data access, this exception can lead
>> to:the following:
>>
>>- Duplicate Data: If Spark retries the fetch operation successfully
>>after the reboot, it might end up processing the same data twice, leading
>>to duplicates.
>>- Missing Data: If Spark cannot fetch all required data blocks due to
>>the unavailable data node, some data might be missing from the processing
>>results.
>>
>> The root cause of this issue lies in the data node reboot itself. So we
>> can conclude that it is not a  problem with Spark core functionality but
>> rather an environmental issue within the distributed storage systemL  You
>> need to ensure that your nodes are stable and minimise unexpected reboots
>> for whatever reason. Look at the host logs  or run /usr/bin/dmesg to see
>> what happened..
>>
>> Good luck
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:
>>
>>> thanks Mich, in a nutshell  if fetchFailedException occurs due to data
>>> node reboot then it  can create duplicate / missing data  .   so this is
>>> more of hardware(env issue ) rather than spark issue .
>>>
>>>
>>>
>>> On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 It seems to me that there are issues related to below

 * I think when a task failed in between  and retry task started
 and completed it may create duplicate as failed task has some data + retry
 task has  full data.  but my question is why spark keeps delta data or
 according to you if speculative and original task completes generally spark
 kills one of the tasks to get rid of dups data.  when a data node is
 rebooted then spark fault tolerant should go to other nodes isn't it ? then
 why it has missing data.*

 Spark is designed to be fault-tolerant through lineage and
 recomputation. However, there are scenarios where speculative execution or
 task retries might lead to duplicated or missing data. So what are these?

 - Task Failure and Retry: You are correct that a failed task might have
 processed some data before encountering the FetchFailedException. If a
 retry succeeds, it would process the entire data partition again, leading
 to duplicates. When a task fails, Spark may recompute the lost data by
 recomputing the lost task on another node.  The output of the retried task
 is typically combined with the output of the original task during the final
 stage of the computation. This combination is done to handle scenarios
 where the original task partially completed and generated some output
 before failing. Spark does not in

Re: Spark on Yarn with Java 17

2023-12-10 Thread Jason Xu
Doogjoon and Luca, it's great to learn that there is a way to run different
JVM versions for Spark and Hadoop binaries. I had concerns about Java
compatibility issues without this solution. Thank you!

Luca, thank you for providing a how-to guide for this. It's really helpful!

On Sat, Dec 9, 2023 at 1:39 AM Luca Canali  wrote:

> Jason, In case you need a pointer on how to run Spark with a version of
> Java different than the version used by the Hadoop processes, as indicated
> by Dongjoon, this is an example of what we do on our Hadoop clusters:
> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Set_Java_Home_Howto.md
>
>
>
> Best,
>
> Luca
>
>
>
> *From:* Dongjoon Hyun 
> *Sent:* Saturday, December 9, 2023 09:39
> *To:* Jason Xu 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Spark on Yarn with Java 17
>
>
>
> Please try Apache Spark 3.3+ (SPARK-33772) with Java 17 on your cluster
> simply, Jason.
>
> I believe you can set up for your Spark 3.3+ jobs to run with Java 17
> while your cluster(DataNode/NameNode/ResourceManager/NodeManager) is still
> sitting on Java 8.
>
> Dongjoon.
>
>
>
> On Fri, Dec 8, 2023 at 11:12 PM Jason Xu  wrote:
>
> Dongjoon, thank you for the fast response!
>
>
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> To better understand your answer, does that mean a Spark application built
> with Java 17 can successfully run on a Hadoop cluster on version 3.3 and
> Java 8 runtime?
>
>
>
> On Fri, Dec 8, 2023 at 4:33 PM Dongjoon Hyun  wrote:
>
> Hi, Jason.
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> You can track all `Apache Spark 4` activities including Hadoop dependency
> here.
>
> https://issues.apache.org/jira/browse/SPARK-44111
> (Prepare Apache Spark 4.0.0)
>
> According to the release history, the original suggested timeline was
> June, 2024.
> - Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
>     - Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> - Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> - Spark 4: 2024.06 (4.0.0, NEW)
>
> Thanks,
> Dongjoon.
>
> On 2023/12/08 23:50:15 Jason Xu wrote:
> > Hi Spark devs,
> >
> > According to the Spark 3.5 release notes, Spark 4 will no longer support
> > Java 8 and 11 (link
> > <
> https://spark.apache.org/releases/spark-release-3-5-0.html#upcoming-removal
> >
> > ).
> >
> > My company is using Spark on Yarn with Java 8 now. When considering a
> > future upgrade to Spark 4, one issue we face is that the latest version
> of
> > Hadoop (3.3) does not yet support Java 17. There is an open ticket (
> > HADOOP-17177 <https://issues.apache.org/jira/browse/HADOOP-17177>) for
> this
> > issue, which has been open for over two years.
> >
> > My question is: Does the release of Spark 4 depend on the availability of
> > Java 17 support in Hadoop? Additionally, do we have a rough estimate for
> > the release of Spark 4? Thanks!
> >
> >
> > Cheers,
> >
> > Jason Xu
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark on Yarn with Java 17

2023-12-08 Thread Jason Xu
Dongjoon, thank you for the fast response!

Apache Spark 4.0.0 depends on only Apache Hadoop client library.

To better understand your answer, does that mean a Spark application built
with Java 17 can successfully run on a Hadoop cluster on version 3.3 and
Java 8 runtime?

On Fri, Dec 8, 2023 at 4:33 PM Dongjoon Hyun  wrote:

> Hi, Jason.
>
> Apache Spark 4.0.0 depends on only Apache Hadoop client library.
>
> You can track all `Apache Spark 4` activities including Hadoop dependency
> here.
>
> https://issues.apache.org/jira/browse/SPARK-44111
> (Prepare Apache Spark 4.0.0)
>
> According to the release history, the original suggested timeline was
> June, 2024.
> - Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> - Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> - Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> - Spark 4: 2024.06 (4.0.0, NEW)
>
> Thanks,
> Dongjoon.
>
> On 2023/12/08 23:50:15 Jason Xu wrote:
> > Hi Spark devs,
> >
> > According to the Spark 3.5 release notes, Spark 4 will no longer support
> > Java 8 and 11 (link
> > <
> https://spark.apache.org/releases/spark-release-3-5-0.html#upcoming-removal
> >
> > ).
> >
> > My company is using Spark on Yarn with Java 8 now. When considering a
> > future upgrade to Spark 4, one issue we face is that the latest version
> of
> > Hadoop (3.3) does not yet support Java 17. There is an open ticket (
> > HADOOP-17177 <https://issues.apache.org/jira/browse/HADOOP-17177>) for
> this
> > issue, which has been open for over two years.
> >
> > My question is: Does the release of Spark 4 depend on the availability of
> > Java 17 support in Hadoop? Additionally, do we have a rough estimate for
> > the release of Spark 4? Thanks!
> >
> >
> > Cheers,
> >
> > Jason Xu
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Spark on Yarn with Java 17

2023-12-08 Thread Jason Xu
Hi Spark devs,

According to the Spark 3.5 release notes, Spark 4 will no longer support
Java 8 and 11 (link
<https://spark.apache.org/releases/spark-release-3-5-0.html#upcoming-removal>
).

My company is using Spark on Yarn with Java 8 now. When considering a
future upgrade to Spark 4, one issue we face is that the latest version of
Hadoop (3.3) does not yet support Java 17. There is an open ticket (
HADOOP-17177 <https://issues.apache.org/jira/browse/HADOOP-17177>) for this
issue, which has been open for over two years.

My question is: Does the release of Spark 4 depend on the availability of
Java 17 support in Hadoop? Additionally, do we have a rough estimate for
the release of Spark 4? Thanks!


Cheers,

Jason Xu


Re: Data correctness issue with Repartition + FetchFailure

2022-03-15 Thread Jason Xu
Hi Wenchen, thanks for the insight. Agree, the previous fix for repartition
works for deterministic data. With non-deterministic data, I didn't find an
API to pass DeterministicLevel to underlying rdd.
Do you plan to continue work on integration with SQL operators? If not, I'm
available to take a stab.

On Mon, Mar 14, 2022 at 7:00 PM Wenchen Fan  wrote:

> We fixed the repartition correctness bug before, by sorting the data
> before doing round-robin partitioning. But the issue is that we need to
> propagate the isDeterministic property through SQL operators.
>
> On Tue, Mar 15, 2022 at 1:50 AM Jason Xu  wrote:
>
>> Hi Reynold, do you suggest removing RoundRobinPartitioning in
>> repartition(numPartitions: Int) API implementation? If that's the direction
>> we're considering, before we have a new implementation, should we suggest
>> users avoid using the repartition(numPartitions: Int) API?
>>
>> On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:
>>
>>> This is why RoundRobinPartitioning shouldn't be used ...
>>>
>>>
>>> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
>>> wrote:
>>>
>>>> Hi Spark community,
>>>>
>>>> I reported a data correctness issue in
>>>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>>>> non-deterministic data + Repartition + FetchFailure could result in
>>>> incorrect data, this is an issue we run into in production pipelines, I
>>>> have an example to reproduce the bug in the ticket.
>>>>
>>>> I report here to bring more attention, could you help confirm it's a
>>>> bug and worth effort to further investigate and fix, thank you in advance
>>>> for help!
>>>>
>>>> Thanks,
>>>> Jason Xu
>>>>
>>>
>>>


Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Jason Xu
Hi Reynold, do you suggest removing RoundRobinPartitioning in
repartition(numPartitions: Int) API implementation? If that's the direction
we're considering, before we have a new implementation, should we suggest
users avoid using the repartition(numPartitions: Int) API?

On Sat, Mar 12, 2022 at 1:47 PM Reynold Xin  wrote:

> This is why RoundRobinPartitioning shouldn't be used ...
>
>
> On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu 
> wrote:
>
>> Hi Spark community,
>>
>> I reported a data correctness issue in
>> https://issues.apache.org/jira/browse/SPARK-38388. In short,
>> non-deterministic data + Repartition + FetchFailure could result in
>> incorrect data, this is an issue we run into in production pipelines, I
>> have an example to reproduce the bug in the ticket.
>>
>> I report here to bring more attention, could you help confirm it's a bug
>> and worth effort to further investigate and fix, thank you in advance for
>> help!
>>
>> Thanks,
>> Jason Xu
>>
>
>


Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Jason Xu
Hi Spark community,

I reported a data correctness issue in
https://issues.apache.org/jira/browse/SPARK-38388. In short,
non-deterministic data + Repartition + FetchFailure could result in
incorrect data, this is an issue we run into in production pipelines, I
have an example to reproduce the bug in the ticket.

I report here to bring more attention, could you help confirm it's a bug
and worth effort to further investigate and fix, thank you in advance for
help!

Thanks,
Jason Xu