GitHub user rajeshbalamohan opened a pull request:

    https://github.com/apache/spark/pull/12105

    SPARK-14321. [SQL]  Reduce date format cost and string-to-date cost i…

    ## What changes were proposed in this pull request?
    
    Here is the generated code snippet when executing date functions. 
SimpleDateFormat is fairly expensive and can show up bottleneck when processing 
millions of records. It would be better to instantiate it once.
    
    ```
    /* 066 */     UTF8String primitive5 = null;
    /* 067 */     if (!isNull4) {
    /* 068 */       try {
    /* 069 */         primitive5 = UTF8String.fromString(new 
java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(
    /* 070 */             new java.util.Date(primitive7 * 1000L)));
    /* 071 */       } catch (java.lang.Throwable e) {
    /* 072 */         isNull4 = true;
    /* 073 */       }
    /* 074 */     }
    ```
    
    With modified code, here is the generated code
    ```
    /* 010 */   private java.text.SimpleDateFormat sdf2;
    /* 011 */   private UnsafeRow result13;
    /* 012 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14;
    /* 013 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15;
    /* 014 */
    ...
    ...
    /* 065 */     boolean isNull0 = isNull3;
    /* 066 */     UTF8String primitive1 = null;
    /* 067 */     if (!isNull0) {
    /* 068 */       try {
    /* 069 */         if (sdf2 == null) {
    /* 070 */           sdf2 = new java.text.SimpleDateFormat("yyyy-MM-dd 
HH:mm:ss");
    /* 071 */         }
    /* 072 */         primitive1 = UTF8String.fromString(sdf2.format(
    /* 073 */             new java.util.Date(primitive4 * 1000L)));
    /* 074 */       } catch (java.lang.Throwable e) {
    /* 075 */         isNull0 = true;
    /* 076 */       }
    /* 077 */     }
    ```
    
    Similarly Calendar.getInstance was used in DateTimeUtils which can be 
lazily inited.
    
    
    ## How was this patch tested?
    
    
org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite
    Also tried with couple of sample SQL queries with single executor (6GB) 
which showed good improvement with the fix.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rajeshbalamohan/spark SPARK-14321

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12105
    
----
commit 6fd07db11b5c9eed795dde11177f1c245a6fef16
Author: Rajesh Balamohan <[email protected]>
Date:   2016-04-01T02:41:07Z

    SPARK-14321. [SQL]  Reduce date format cost and string-to-date cost in date 
functions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to