Hi everyone, In my free time I translate scripts from open source projects or write my own, to train my python skills. ATM I convert the aplogmerge.pl from awstats. It merges multiple apache logfiles and sort the output by the timestamps of each line. My first version of this script hasn't a good performance, so I started profiling. It turned out that the script spend much time for converting the timestamps of the line into a struct_time object. Here a code example (Python 2.6.2 on Ubuntu 7.10):
Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*") Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"' def strptime(): m = Rec.match(Line) if m: date_string = m.group(1) # date_string example: '02/Jul/2009:01:50:26' return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S") with timeit this functions takes approximate 125 sec and 29.004.081 function calls (I've configured timeit with 1.000.000 runs). A look at the output of cProfile told me that more than the half time is spent in locale.py: 1000002 11.712 0.000 19.592 0.000 locale.py:316(normalize) 1000002 3.639 0.000 23.231 0.000 locale.py:382(_parse_localename) 1000002 5.162 0.000 30.271 0.000 locale.py:481(getlocale) I studied the time documentation and thought that I had to set TZ in os environ: os.environ['TZ'] = 'Europe/Berlin' time.set() but that had no effect. I don't know why time.strptime() looks every time for my locale. Maybe it's a problem with my OS locale... However. I've introduced a work around, which works perfectly for my problem. For time comparison I could use any sort of time representation and so I convert to epoch: # needed to translate month str to dig repr Shortmonth = {'Jan' : '01', 'Feb' : '02', 'Mar' : '03', 'Apr' : '04', 'May' : '05', 'Jun' : '06', 'Jul' : '07', 'Aug' : '08', 'Sep' : '09', 'Oct' : '10', 'Nov' : '11', 'Dec' : '12'} Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*") Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"' def epoch(): m = Rec.match(Line) if m: result = m.groupdict() date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1) date_tuple = map(int,date_tuple) return time.mktime(date_tuple) with this workaround I had a speed up to 4 times; it tooks only 31 sec with only 5.000.009 function calls. Maybe this helps some of you, who had the similar problems with time conversion.... ...But one big question remains: Why time.strptime() checks everytime the locale? had I missed something or could I have a problem with my OS locale? With python 3.1 there is no difference, unless that time.strptime() took approx 12 sec longer... :( regards, Nils Here a complete test script: #!/opt/python/2.6.2/bin/python import time import timeit import cProfile import re # needed to tranlate month str to dig repr Shortmonth = {'Jan' : '01', 'Feb' : '02', 'Mar' : '03', 'Apr' : '04', 'May' : '05', 'Jun' : '06', 'Jul' : '07', 'Aug' : '08', 'Sep' : '09', 'Oct' : '10', 'Nov' : '11', 'Dec' : '12'} Rec1 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*") Rec2 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*") Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"' def epoch(): m = Rec1.match(Line) if m: result = m.groupdict() date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1) date_tuple = map(int,date_tuple) return time.mktime(date_tuple) def strptime(): m = Rec2.match(Line) if m: date_string = m.group(1) return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S") if __name__ == "__main__": t1 = timeit.Timer("epoch()","from __main__ import epoch") t2 = timeit.Timer("strptime()", "from __main__ import strptime") cProfile.run("t1.timeit();print") print "" cProfile.run("t2.timeit();print") -- http://mail.python.org/mailman/listinfo/python-list