Could you take three thread dumps from one of the executors while Spark is
performing the conversion? You can use the Spark UI for that.

El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
angel.alvarez.pas...@gmail.com>) escribió:

> Without the data, it's difficult to analyze. Could you provide some
> synthetic data so I can investigate this further? The schema and a few
> sample fake rows should be sufficient.
>
> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<prem.re...@gmail.com>)
> escribió:
>
>> I am providing the schema , and schema is actually correct means it has
>> all the columns available in csv . So we can take out this issue for
>> slowness .  May be there is some other contributing options .
>> Sent from my iPhone
>>
>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
>> angel.alvarez.pas...@gmail.com> wrote:
>>
>> 
>>
>> Hey, just this week I found some issues with the Univocity library that
>> Spark internally uses to read CSV files.
>>
>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>>
>> I initially assumed this issue had existed since Spark started using this
>> library, but perhaps something changed in the versions you mentioned.
>>
>> Are you providing a schema, or are you letting Spark infer it? I've also
>> noticed that when the schema doesn't match the columns in the CSV files
>> (for example, different number of columns), exceptions are thrown
>> internally.
>>
>> Given all this, my initial hypothesis is that thousands upon thousands of
>> exceptions are being thrown internally, only to be handled by the Univocity
>> parser—so the user isn't even aware of what's happening.
>>
>>
>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<prem.re...@gmail.com>)
>> escribió:
>>
>>> Hello ,
>>> I read the csv file having size of 2.7 gb which is having 100 columns ,
>>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>>> stat is bad .
>>> Sent from my iPhone
>>>
>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com> wrote:
>>>
>>> 
>>> Sure. I love performance challenges and mysteries!
>>>
>>> Please, could you provide an example project or the steps to build one?
>>>
>>> Thanks.
>>>
>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <prem.re...@gmail.com> escribió:
>>>
>>>> Hello Team,
>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>>> object storage . It was slower when compared to write to MapR FS with above
>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and
>>>> check ed the performance which is worse than old tech stack. Then tried
>>>> using magic committer and it came out slower than V2 so with the latest
>>>> tech stack the performance is down graded. Could some please assist .
>>>> Sent from my iPhone
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Reply via email to