Hi all,

The task I'm trying to accomplish:

Given a log file containing several lines of white space delimited entries like 
this:

[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not 
exist
[Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not 
exist
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not 
exist
[Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not 
exist

filter lines that match the string " 08:", extract the 6th, 7th and 8th words 
from that line, group all lines that have the the same resulting string, do a 
count on them and sort the result in descending order of counts and print it 
out. So in the example above we'd end up with an output like this:

("GET /url1 HTTP/1.1]:", 2)
("GET /url2 HTTP/1.0]:", 1)

Seems pretty straightforward, so I wrote a simple perl script to achieve this 
task (see the bottom of this email).

The input file is 335 MB in size and contains about 2 million log line entires 
in it. The perl script does a pretty decent job and finishes in about 3 seconds.

Now the interesting part. I decided to implememt this in Haskell (being my 
favorite language and all) and ended up with the following code:

--- begin haskell code ---

import Text.Regex.Posix ( (=~) )
import qualified Data.List as List  
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy.Char8 as LB

main = do
  contents <- LB.readFile "log_file"
  putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
    -- create a Map grouping & counting matching tokens and sort based on the 
counts
    List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . 
Map.fromListWith (+) . filtertokens .
    LB.lines $ contents
  where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []
          -- filter lines starting with " 08:"
          where f = (=~ " 08:") . LB.unpack
                -- extract tokens 6, 7 & 8 and create an association list like 
so ("GET /url2 HTTP/1.0]:", 1)
                g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] 
where xs = LB.words line

--- end haskell code ---

This haskell implementation takes a whopping 27 seconds to complete! About 9 
times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 
and even went to the extent of fusing an adjacent map and filter using a foldr 
like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents 
maps etc. Still the same result.

I really hope I'm missing some obvious optimization that's making it so slow 
compared to the perl version, hence this email soliciting feedback.

Thanks in advance.

P.S. For reference, here's my corresponding perl implementation:

--- start perl code ---

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

my %urls;
open(FILE, '<', $ARGV[0]);
while(<FILE>) {
    if (/ 08:/) {
        my @words = split;
        my $key = join(" ", ($words[6], $words[7], $words[8]));
        if (exists $urls{$key}) { $urls{$key}++ }
        else { $urls{$key} = 1 }
    }
}
for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }

--- end perl code ---




      
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to