Oh sorry, I misunderstood the issue. It's not about the user-facing error, but the performance issue when the `to_date` function deals with invalid date strings. This is unfortunately not easy to fix, as Spark relies on the JDK library to parse datetime strings, and we can only know if a string is valid or not by catching the exception thrown from the JDK library.
We can probably do some easy validation first, like empty string, to avoid invoking the JDK API to parse invalid datetime strings. BTW a small tip: when you call `to_date` without a date format pattern, Spark actually uses a hand-written parser that does not create and swallow an exception for invalid string, but simply return null. This will be more efficient for your case. On Thu, Oct 10, 2024 at 7:29 PM Wenchen Fan <cloud0...@gmail.com> wrote: > There is a `try_to_timestamp` function but not `try_to_date`, we should > probably add it for users who don't want to get runtime errors when > processing big dataset. > > On Thu, Oct 10, 2024 at 11:05 AM Ángel <angel.alvarez.pas...@gmail.com> > wrote: > >> Hi, >> >> I opened a Jira ticket back in August, but it seems to have been >> overlooked. While it may not be a critical issue, I would appreciate if you >> could take a moment to consider it before deciding whether to close it. >> >> Here is the ticket for reference: >> SPARK-49288 <https://issues.apache.org/jira/browse/SPARK-49288> >> >> I've also written an article related to the issue, which you can find >> here: >> Apache Spark: WTF? Stranded on Dates Rows >> <https://medium.com/@angel.alvarez.pascua/apache-spark-wtf-stranded-on-dates-rows-74f0d9788b8b> >> >> In short, the problem occurs when the to_date built-in function >> encounters invalid date strings. Each time this happens, a new >> ParseException is thrown. While this isn't a big deal with small >> datasets, when you're processing millions of rows, the sheer volume of >> exceptions can become a significant performance issue. I understand that >> validating date strings is expensive, but checking for empty strings >> shouldn't be. >> >> I’m only asking for either an optimization for empty string checks or, at >> the very least, a warning in the documentation about the potential >> performance impact. >> >> Thanks for taking the time to consider this. >> >