GitHub user rajeshbalamohan opened a pull request:
https://github.com/apache/spark/pull/12105
SPARK-14321. [SQL] Reduce date format cost and string-to-date cost iâ¦
## What changes were proposed in this pull request?
Here is the generated code snippet when executing date functions.
SimpleDateFormat is fairly expensive and can show up bottleneck when processing
millions of records. It would be better to instantiate it once.
```
/* 066 */ UTF8String primitive5 = null;
/* 067 */ if (!isNull4) {
/* 068 */ try {
/* 069 */ primitive5 = UTF8String.fromString(new
java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(
/* 070 */ new java.util.Date(primitive7 * 1000L)));
/* 071 */ } catch (java.lang.Throwable e) {
/* 072 */ isNull4 = true;
/* 073 */ }
/* 074 */ }
```
With modified code, here is the generated code
```
/* 010 */ private java.text.SimpleDateFormat sdf2;
/* 011 */ private UnsafeRow result13;
/* 012 */ private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14;
/* 013 */ private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15;
/* 014 */
...
...
/* 065 */ boolean isNull0 = isNull3;
/* 066 */ UTF8String primitive1 = null;
/* 067 */ if (!isNull0) {
/* 068 */ try {
/* 069 */ if (sdf2 == null) {
/* 070 */ sdf2 = new java.text.SimpleDateFormat("yyyy-MM-dd
HH:mm:ss");
/* 071 */ }
/* 072 */ primitive1 = UTF8String.fromString(sdf2.format(
/* 073 */ new java.util.Date(primitive4 * 1000L)));
/* 074 */ } catch (java.lang.Throwable e) {
/* 075 */ isNull0 = true;
/* 076 */ }
/* 077 */ }
```
Similarly Calendar.getInstance was used in DateTimeUtils which can be
lazily inited.
## How was this patch tested?
org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite
Also tried with couple of sample SQL queries with single executor (6GB)
which showed good improvement with the fix.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rajeshbalamohan/spark SPARK-14321
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12105.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12105
----
commit 6fd07db11b5c9eed795dde11177f1c245a6fef16
Author: Rajesh Balamohan <[email protected]>
Date: 2016-04-01T02:41:07Z
SPARK-14321. [SQL] Reduce date format cost and string-to-date cost in date
functions
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]