Hey Seth,

I should probably add a little documentation on this, but you're basically
going to want to take the same approach as the vectorized DateTime parsing
function (see here: http://goo.gl/0z6jI8).

Basically, you can create a `DateFormat` object once, and pass that to the
`DateTime` function so it'll reuse the same `DateFormat` instead of
creating a new one for each call. Like I said, we should probably add a
little documentation around how to do this.

f = Dates.DateFormat("uuu dd HH:MM:SS")

mkdt(dts::AbstractString, df::Dates.DateFormat) = DateTime(dts, df) +
Dates.Year(2014)

There may also be other optimizations we can do in the actual parsing, but
this should give a significant bump. Maybe someday, I'll get around to
bugging Jake Bolewski about how to do it more efficiently/robustly, with
all his parsing chops.

-Jacob

On Wed, Feb 4, 2015 at 5:07 PM, Seth <[email protected]> wrote:

> I have a 5.9-million line logfile that starts with dates of the format
> "Jan  23 14:15:16". I am converting these to DateTime via
>
> mkdt(dts::AbstractString) = DateTime(dts, "uuu dd HH:MM:SS") + Dates.Year(
> 2014)
>
> and calling mkdt via
>
>     words = split(l)
>     dt = mkdt(join(words[1:3]," "))
>
>
>
> Processing the file using DateTime takes an exceedingly long time (15
> minutes) vs storing the dates as a string (just keeping words[1:3] - 2.5
> minutes). A @profile on a 100k line sample file shows most of the time
> (9480 / 10251 samples) in the mkdt call above.
>
> Is there anything I can do to speed this up or is it just a given that
> creating DateTime types will be the slowest part of this processing?
>

Reply via email to