GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/13522
[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in⦠## What changes were proposed in this pull request? Here is the generated code snippet when executing date functions. SimpleDateFormat is fairly expensive and can show up bottleneck when processing millions of records. It would be better to instantiate it once. ``` /* 066 */ UTF8String primitive5 = null; /* 067 */ if (!isNull4) { /* 068 */ try { /* 069 */ primitive5 = UTF8String.fromString(new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format( /* 070 */ new java.util.Date(primitive7 * 1000L))); /* 071 */ } catch (java.lang.Throwable e) { /* 072 */ isNull4 = true; /* 073 */ } /* 074 */ } ``` With modified code, here is the generated code ``` /* 010 */ private java.text.SimpleDateFormat sdf2; /* 011 */ private UnsafeRow result13; /* 012 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14; /* 013 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15; /* 014 */ ... ... /* 065 */ boolean isNull0 = isNull3; /* 066 */ UTF8String primitive1 = null; /* 067 */ if (!isNull0) { /* 068 */ try { /* 069 */ if (sdf2 == null) { /* 070 */ sdf2 = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); /* 071 */ } /* 072 */ primitive1 = UTF8String.fromString(sdf2.format( /* 073 */ new java.util.Date(primitive4 * 1000L))); /* 074 */ } catch (java.lang.Throwable e) { /* 075 */ isNull0 = true; /* 076 */ } /* 077 */ } ``` Similarly Calendar.getInstance was used in DateTimeUtils which can be lazily inited. ## How was this patch tested? org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14321-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13522 ---- commit 602d4a70ba845df3160a07c2c9afe2d5c3c574c4 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-06-06T12:54:02Z [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org