Re: [DISCUSS?] Adding some empty string check in to_date built-in function + warning in documentation

Ángel Thu, 10 Oct 2024 04:53:23 -0700

Didn't know about that. I'll have a look at it and check whether fix the
issue or not. Thanks


El jue, 10 oct 2024, 13:29, Wenchen Fan <[email protected]> escribió:

> There is a `try_to_timestamp` function but not `try_to_date`, we should
> probably add it for users who don't want to get runtime errors when
> processing big dataset.
>
> On Thu, Oct 10, 2024 at 11:05 AM Ángel <[email protected]>
> wrote:
>
>> Hi,
>>
>> I opened a Jira ticket back in August, but it seems to have been
>> overlooked. While it may not be a critical issue, I would appreciate if you
>> could take a moment to consider it before deciding whether to close it.
>>
>> Here is the ticket for reference:
>> SPARK-49288 <https://issues.apache.org/jira/browse/SPARK-49288>
>>
>> I've also written an article related to the issue, which you can find
>> here:
>> Apache Spark: WTF? Stranded on Dates Rows
>> <https://medium.com/@angel.alvarez.pascua/apache-spark-wtf-stranded-on-dates-rows-74f0d9788b8b>
>>
>> In short, the problem occurs when the to_date built-in function
>> encounters invalid date strings. Each time this happens, a new
>> ParseException is thrown. While this isn't a big deal with small
>> datasets, when you're processing millions of rows, the sheer volume of
>> exceptions can become a significant performance issue. I understand that
>> validating date strings is expensive, but checking for empty strings
>> shouldn't be.
>>
>> I’m only asking for either an optimization for empty string checks or, at
>> the very least, a warning in the documentation about the potential
>> performance impact.
>>
>> Thanks for taking the time to consider this.
>>
>

Re: [DISCUSS?] Adding some empty string check in to_date built-in function + warning in documentation

Reply via email to