Hi,

I opened a Jira ticket back in August, but it seems to have been
overlooked. While it may not be a critical issue, I would appreciate if you
could take a moment to consider it before deciding whether to close it.

Here is the ticket for reference:
SPARK-49288 <https://issues.apache.org/jira/browse/SPARK-49288>

I've also written an article related to the issue, which you can find here:
Apache Spark: WTF? Stranded on Dates Rows
<https://medium.com/@angel.alvarez.pascua/apache-spark-wtf-stranded-on-dates-rows-74f0d9788b8b>

In short, the problem occurs when the to_date built-in function encounters
invalid date strings. Each time this happens, a new ParseException is
thrown. While this isn't a big deal with small datasets, when you're
processing millions of rows, the sheer volume of exceptions can become a
significant performance issue. I understand that validating date strings is
expensive, but checking for empty strings shouldn't be.

I’m only asking for either an optimization for empty string checks or, at
the very least, a warning in the documentation about the potential
performance impact.

Thanks for taking the time to consider this.

Reply via email to