Github user carsonwang commented on the pull request:

    https://github.com/apache/spark/pull/11071#issuecomment-180207538
  
    I have a sub query like this `SELECT a, b, c FROM table UV  WHERE 
(datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, 
'2015-01-01')<=0)) `
    When profiling this stage with Spark 1.6, I noticed a lot time was consumed 
by `DateTimeUtils.stringToDate`. Especially, `TimeZone.getTimeZone` and 
`Calendar.getInstance` are extremely slow. The table stores `visitDate` as 
`String` type and has 3 billion records. This means it creates 3 billion 
`Calendar` and `TimeZone` objects.
    
    `TimeZone.getTimeZone` is a synchronized method and will block other 
threads calling this same method. #10994 fixed one for 
`DateTimeUtils.stringToDate`. But `DateTimeUtils.stringToTimestamp` has the 
same issue so I tried to cache the `TimeZone` objects in a map. The total 
available number of `TimeZone` should be limited.
    
    By reusing `Calendar` object instead of creating it each time in the 
method, I can see more performance improvement. Creating 20 millions `Calendar` 
objects will take more that 20 seconds on my machine. So we will benefit from 
reusing it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to