[jira] [Updated] (SPARK-49288) to_date ... too slow

Jira Tue, 27 Aug 2024 00:56:24 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ángel Álvarez Pascua updated SPARK-49288:
-----------------------------------------
    Description: 
The *to_date* _ _built-in udf is creating new _ParseException_ instances every 
time a string value cam't be parsed. _ParseException_ extends {_}Exception{_}, 
which in turn, extends {_}Throwable{_}, that calls the *fillInStackTrace* 
method. This method is not only one of the most expensive methods in Java, but 
it's also synchronized. That means, that could introduce some overhead.

Here's a stracktrace as example:

_java.lang.Throwable.fillInStackTrace(Native Method)_
*_java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding 
Monitor(java.text.ParseException@176695786})_*
_java.lang.Throwable.<init>(Throwable.java:265)_
_java.lang.Exception.<init>(Exception.java:66)_
_java.text.ParseException.<init>(ParseException.java:63)_
_java.text.DateFormat.parse(DateFormat.java:366)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.agg_doAggregateWithKeys_0$(Unknown
 Source)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown
 Source)_
_org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)_
_org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)_
_scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)_
_org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)_
_org.apache.spark.scheduler.Task.run(Task.scala:123)_
_org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:413)_
_org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)_
_org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)_
_java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_java.lang.Thread.run(Thread.java:748)_

 

Because empty strings also throw _ParseException_ errors, trying to parse 
several date fields with empty values from a large dataframe could take the 
Spark task much more time than expected (and needed).

 

The following is strongly suggested:

 
 # Add some kind of warning in the udf's documentation page, clearly stating 
that lots of non-valid string date values will introduce serious performance 
issues.
 # Add some kind of check or control to the string values before parsing in 
order to prevent the parsing of the string and the unnecessary creation of lots 
of exceptions. At least, for empty values, if checking valid string dates is 
considered a costly operation.

 

 

  was:
The *to_date* _ _built-in udf is creating new _ParseException_ instances every 
time a string value cam't be parsed. _ParseException_ extends {_}Exception{_}, 
which in turn, extends {_}Throwable{_}, that calls the *fillInStackTrace* 
method. This method is not only one of the most expensive methods in Java, but 
it's also synchronized. That means, that could introduce contentions and impact 
negatively to other threads(cores) trying to also create its own exceptions, as 
you can see in the following stacktrace:

 

_java.lang.Throwable.fillInStackTrace(Native Method)_
*_java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding 
Monitor(java.text.ParseException@176695786})_*
_java.lang.Throwable.<init>(Throwable.java:265)_
_java.lang.Exception.<init>(Exception.java:66)_
_java.text.ParseException.<init>(ParseException.java:63)_
_java.text.DateFormat.parse(DateFormat.java:366)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.agg_doAggregateWithKeys_0$(Unknown
 Source)_
_org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown
 Source)_
_org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)_
_org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)_
_scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)_
_org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)_
_org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)_
_org.apache.spark.scheduler.Task.run(Task.scala:123)_
_org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:413)_
_org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)_
_org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)_
_java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_java.lang.Thread.run(Thread.java:748)_

 

Because empty strings also throw _ParseException_ errors, trying to parse 
several date fields with empty values from a large dataframe could take the 
Spark task much more time than expected (and needed).

 

The following is strongly suggested:

 
 # Add some kind of warning in the udf's documentation page, clearly stating 
that lots of non-valid string date values will introduce serious performance 
issues.
 # Add some kind of check or control to the string values before parsing in 
order to prevent the parsing of the string and the unnecessary creation of lots 
of exceptions. At least, for empty values, if checking valid string dates is 
considered a costly operation.

 

 


> to_date ... too slow
> --------------------
>
>                 Key: SPARK-49288
>                 URL: https://issues.apache.org/jira/browse/SPARK-49288
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.7, 3.5.2
>         Environment: Because this issue has to do with how exceptions are 
> handled in Java, it's not only easy to reproduce, but also it doesn't really 
> matter the environment used. Nevertheless, the issue has been reproduced in 
> the following environments:
>  * Linux - Java 8 - Spark 2.4.7 (Cloudera 7.1.7)
>  * Windows 11 - Java 8 - Scala 2.11.12 - Spark 2.4.7
>  * Windows 11 - Java 11 - Scala 2.11.12 - Spark 2.4.7
>  * Windows 11 - Java 8 - Scala 2.12.17 - Spark 3.5.2
>  * Windows 11 - Java 11 - Scala 2.12.17 - Spark 3.5.2
>            Reporter: Ángel Álvarez Pascua
>            Priority: Major
>
> The *to_date* _ _built-in udf is creating new _ParseException_ instances 
> every time a string value cam't be parsed. _ParseException_ extends 
> {_}Exception{_}, which in turn, extends {_}Throwable{_}, that calls the 
> *fillInStackTrace* method. This method is not only one of the most expensive 
> methods in Java, but it's also synchronized. That means, that could introduce 
> some overhead.
> Here's a stracktrace as example:
> _java.lang.Throwable.fillInStackTrace(Native Method)_
> *_java.lang.Throwable.fillInStackTrace(Throwable.java:783) => holding 
> Monitor(java.text.ParseException@176695786})_*
> _java.lang.Throwable.<init>(Throwable.java:265)_
> _java.lang.Exception.<init>(Exception.java:66)_
> _java.text.ParseException.<init>(ParseException.java:63)_
> _java.text.DateFormat.parse(DateFormat.java:366)_
> _org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.agg_doAggregateWithKeys_0$(Unknown
>  Source)_
> _org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown
>  Source)_
> _org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)_
> _org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:645)_
> _scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)_
> _org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)_
> _org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)_
> _org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)_
> _org.apache.spark.scheduler.Task.run(Task.scala:123)_
> _org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:413)_
> _org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1334)_
> _org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:419)_
> _java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
> _java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
> _java.lang.Thread.run(Thread.java:748)_
>  
> Because empty strings also throw _ParseException_ errors, trying to parse 
> several date fields with empty values from a large dataframe could take the 
> Spark task much more time than expected (and needed).
>  
> The following is strongly suggested:
>  
>  # Add some kind of warning in the udf's documentation page, clearly stating 
> that lots of non-valid string date values will introduce serious performance 
> issues.
>  # Add some kind of check or control to the string values before parsing in 
> order to prevent the parsing of the string and the unnecessary creation of 
> lots of exceptions. At least, for empty values, if checking valid string 
> dates is considered a costly operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-49288) to_date ... too slow

Reply via email to