Github user carsonwang commented on the pull request: https://github.com/apache/spark/pull/11071#issuecomment-180207538 I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0)) ` When profiling this stage with Spark 1.6, I noticed a lot time was consumed by `DateTimeUtils.stringToDate`. Especially, `TimeZone.getTimeZone` and `Calendar.getInstance` are extremely slow. The table stores `visitDate` as `String` type and has 3 billion records. This means it creates 3 billion `Calendar` and `TimeZone` objects. `TimeZone.getTimeZone` is a synchronized method and will block other threads calling this same method. #10994 fixed one for `DateTimeUtils.stringToDate`. But `DateTimeUtils.stringToTimestamp` has the same issue so I tried to cache the `TimeZone` objects in a map. The total available number of `TimeZone` should be limited. By reusing `Calendar` object instead of creating it each time in the method, I can see more performance improvement. Creating 20 millions `Calendar` objects will take more that 20 seconds on my machine. So we will benefit from reusing it.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org