Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Ángel Álvarez Pascua Sat, 22 Mar 2025 23:09:45 -0700

Could you take three thread dumps from one of the executors while Spark is
performing the conversion? You can use the Spark UI for that.


El dom, 23 mar 2025 a las 3:20, Ángel Álvarez Pascua (<
[email protected]>) escribió:

> Without the data, it's difficult to analyze. Could you provide some
> synthetic data so I can investigate this further? The schema and a few
> sample fake rows should be sufficient.
>
> El dom, 23 mar 2025 a las 3:17, Prem Sahoo (<[email protected]>)
> escribió:
>
>> I am providing the schema , and schema is actually correct means it has
>> all the columns available in csv . So we can take out this issue for
>> slowness .  May be there is some other contributing options .
>> Sent from my iPhone
>>
>> On Mar 22, 2025, at 10:05 PM, Ángel Álvarez Pascua <
>> [email protected]> wrote:
>>
>> 
>>
>> Hey, just this week I found some issues with the Univocity library that
>> Spark internally uses to read CSV files.
>>
>> *Spark CSV Read Low Performance: EOFExceptions in Univocity Parser*
>> https://issues.apache.org/jira/projects/SPARK/issues/SPARK-51579
>>
>> I initially assumed this issue had existed since Spark started using this
>> library, but perhaps something changed in the versions you mentioned.
>>
>> Are you providing a schema, or are you letting Spark infer it? I've also
>> noticed that when the schema doesn't match the columns in the CSV files
>> (for example, different number of columns), exceptions are thrown
>> internally.
>>
>> Given all this, my initial hypothesis is that thousands upon thousands of
>> exceptions are being thrown internally, only to be handled by the Univocity
>> parser—so the user isn't even aware of what's happening.
>>
>>
>> El dom, 23 mar 2025 a las 2:40, Prem Sahoo (<[email protected]>)
>> escribió:
>>
>>> Hello ,
>>> I read the csv file having size of 2.7 gb which is having 100 columns ,
>>> when I am converting this to parquet with Spark 3.2 and Hadoop 2.7.6 it
>>> takes 28 secs but in Spark 3.5.2 and Hadoop 3.4.1 it takes 34 secs . This
>>> stat is bad .
>>> Sent from my iPhone
>>>
>>> On Mar 22, 2025, at 9:21 PM, Ángel Álvarez Pascua <
>>> [email protected]> wrote:
>>>
>>> 
>>> Sure. I love performance challenges and mysteries!
>>>
>>> Please, could you provide an example project or the steps to build one?
>>>
>>> Thanks.
>>>
>>> El dom, 23 mar 2025, 2:17, Prem Sahoo <[email protected]> escribió:
>>>
>>>> Hello Team,
>>>> I was working with Spark 3.2 and Hadoop 2.7.6 and writing to MinIO
>>>> object storage . It was slower when compared to write to MapR FS with above
>>>> tech stack. Then moved on to later upgraded version of Spark 3.5.2 and
>>>> Hadoop 4.3.1 which started writing to MinIO with V2 fileoutputcommitter and
>>>> check ed the performance which is worse than old tech stack. Then tried
>>>> using magic committer and it came out slower than V2 so with the latest
>>>> tech stack the performance is down graded. Could some please assist .
>>>> Sent from my iPhone
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>>>
>>>>

Re: Spark 3.5.2 and Hadoop 3.4.1 slow performance

Reply via email to