Github user carsonwang commented on the pull request:
https://github.com/apache/spark/pull/11071#issuecomment-180207538
I have a sub query like this `SELECT a, b, c FROM table UV WHERE
(datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate,
'2015-01-01')<=0)) `
When profiling this stage with Spark 1.6, I noticed a lot time was consumed
by `DateTimeUtils.stringToDate`. Especially, `TimeZone.getTimeZone` and
`Calendar.getInstance` are extremely slow. The table stores `visitDate` as
`String` type and has 3 billion records. This means it creates 3 billion
`Calendar` and `TimeZone` objects.
`TimeZone.getTimeZone` is a synchronized method and will block other
threads calling this same method. #10994 fixed one for
`DateTimeUtils.stringToDate`. But `DateTimeUtils.stringToTimestamp` has the
same issue so I tried to cache the `TimeZone` objects in a map. The total
available number of `TimeZone` should be limited.
By reusing `Calendar` object instead of creating it each time in the
method, I can see more performance improvement. Creating 20 millions `Calendar`
objects will take more that 20 seconds on my machine. So we will benefit from
reusing it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]