[dev list to bcc] This is a question for the user list <https://spark.apache.org/community.html> or for Stack Overflow <https://stackoverflow.com/questions/tagged/apache-spark>. The dev list is for discussions related to the development of Spark itself.
Nick > On May 21, 2024, at 6:58 AM, Prem Sahoo <prem.re...@gmail.com> wrote: > > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same dataframe > can be written to two destinations without re execution and cache or persist . > > Can some one help me in scenario 2 ? > How to make spark write to MinIO faster ? > Sent from my iPhone > >> On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> wrote: >> >> >> Hi Prem, >> >> You can try to write to HDFS then read from HDFS and write to MinIO. >> >> This will prevent duplicate transformation. >> >> You can also try persisting the dataframe using the DISK_ONLY level. >> >> Regards, >> Vibhor >> From: Prem Sahoo <prem.re...@gmail.com> >> Date: Tuesday, 21 May 2024 at 8:16 AM >> To: Spark dev list <dev@spark.apache.org> >> Subject: EXT: Dual Write to HDFS and MinIO in faster way >> >> EXTERNAL: Report suspicious emails to Email Abuse. >> >> Hello Team, >> I am planning to write to two datasource at the same time . >> >> Scenario:- >> >> Writing the same dataframe to HDFS and MinIO without re-executing the >> transformations and no cache(). Then how can we make it faster ? >> >> Read the parquet file and do a few transformations and write to HDFS and >> MinIO. >> >> here in both write spark needs execute the transformation again. Do we know >> how we can avoid re-execution of transformation without cache()/persist ? >> >> Scenario2 :- >> I am writing 3.2G data to HDFS and MinIO which takes ~6mins. >> Do we have any way to make writing this faster ? >> >> I don't want to do repartition and write as repartition will have overhead >> of shuffling . >> >> Please provide some inputs.