Ángel Álvarez Pascua created SPARK-49288:
--------------------------------------------
Summary: to_date ... too slow
Key: SPARK-49288
URL: https://issues.apache.org/jira/browse/SPARK-49288
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.5.2, 2.4.7
Environment: Because this issue has to do with how exceptions are
handled in Java, it's not only easy to reproduce, but also it doesn't really
matter the environment used. Nevertheless, the issue has been reproduced in the
following environments:
* Linux - Java 8 - Spark 2.4.7 (Cloudera 7.1.7)
* Windows 11 - Java 8 - Scala 2.11.12 - Spark 2.4.7
* Windows 11 - Java 11 - Scala 2.11.12 - Spark 2.4.7
* Windows 11 - Java 8 - Scala 2.12.17 - Spark 3.5.2
* Windows 11 - Java 11 - Scala 2.12.17 - Spark 3.5.2
Reporter: Ángel Álvarez Pascua
The *to_date* __ built-in udf is creating new _ParseException_ instances every
time a string value cam't be parsed. _ParseException_ extends {_}Exception{_},
which in turn, extends {_}Throwable{_}, that calls the *fillInStackTrace*
method. This method is not only one of the most expensive methods in Java, but
it's also synchronized. That means, that could introduce contentions and impact
negatively to other threads(cores) trying to also create its own exceptions, as
you can see in the following stacktrace:
_java.lang.Throwable.fillInStackTrace(Native Method)_
*_java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding
Monitor(java.text.ParseException@176695786})_*
_java.lang.Throwable.<init>(Throwable.java:265)_
_java.lang.Exception.<init>(Exception.java:66)_
_java.text.ParseException.<init>(ParseException.java:63)_
_java.text.DateFormat.parse(DateFormat.java:366)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.agg_doAggregateWithKeys_0$(Unknown
Source)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown
Source)_
_org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)_
_org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)_
_scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)_
_org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)_
_org.apache.spark.scheduler.Task.run(Task.scala:123)_
_org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:413)_
_org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)_
_org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)_
_java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_java.lang.Thread.run(Thread.java:748)_
Because empty strings also throw _ParseException_ errors, trying to parse
several date fields from a large dataframe could take the Spark task much more
time than expected and needed.
The following is strongly suggested:
# Add some kind of warning in the udf's documentation page, clearly stating
that lots of non-valid string date values will introduce serious performance
issues.
# Add some kind of check or control to the string values before parsing in
order to prevent the parsing of the string and the unnecessary creation of lots
of exceptions. At least, for empty values, if checking valid string dates is
considered a costly operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]