2009/11/8 Gokul P. Nair <[email protected]>
>
> Hi all,
>
> The task I'm trying to accomplish:
>
> Given a log file containing several lines of white space delimited entries 
> like this:
>
> [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does 
> not exist
> [Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does 
> not exist
> [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does 
> not exist
> [Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does 
> not exist
>
> filter lines that match the string " 08:", extract the 6th, 7th and 8th words 
> from that line, group all lines that have the the same resulting string, do a 
> count on them and sort the result in descending order of counts and print it 
> out. So in the example above we'd end up with an output like this:
>
> ("GET /url1 HTTP/1.1]:", 2)
> ("GET /url2 HTTP/1.0]:", 1)
>
> Seems pretty straightforward, so I wrote a simple perl script to achieve this 
> task (see the bottom of this email).
>
> The input file is 335 MB in size and contains about 2 million log line 
> entires in it. The perl script does a pretty decent job and finishes in about 
> 3 seconds.
>
> Now the interesting part. I decided to implememt this in Haskell (being my 
> favorite language and all) and ended up with the following code:
>
> --- begin haskell code ---
>
> import Text.Regex.Posix ( (=~) )

First, you are using Text.Regex.Posix which is dog slow. Use
regex-tdfa or regex-pcre.
Second, you are using it on a String! Third, you are unpacking a lazy
bytestring for that! This alone is more than enough to make things
ridiculously slow.

>
> import qualified Data.List as List
> import qualified Data.Map as Map
> import qualified Data.ByteString.Lazy.Char8 as LB
>
> main = do
>   contents <- LB.readFile "log_file"
>   putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
This piece of code also does unpacking. If the output is small, it's
ok, otherwise assemble output in the form of a bytestring and print it
with B.putStr.

>     -- create a Map grouping & counting matching tokens and sort based on the 
> counts
>     List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . 
> Map.fromListWith (+) . filtertokens .
The lambda can be replaced by "flip (comparing snd)"

>     LB.lines $ contents
Here, you should not use Map.fromListWith (+) because Map is not
strict in its entries and you end up having big fat thunks there.
You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.

>   where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) 
> []
>           -- filter lines starting with " 08:"
>           where f = (=~ " 08:") . LB.unpack
>                 -- extract tokens 6, 7 & 8 and create an association list 
> like so ("GET /url2 HTTP/1.0]:", 1)
>                 g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] 
> where xs = LB.words line
You are using random access on a list three times in a row.
map (xs!!) [6,7,8] is much faster when implemented as "take 3 (drop 6 xs)".


So.. Well. Try these suggestions, show the resulting performance and
then we'll see.

>
> --- end haskell code ---
>
> This haskell implementation takes a whopping 27 seconds to complete! About 9 
> times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 
> and even went to the extent of fusing an adjacent map and filter using a 
> foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing 
> adjacents maps etc. Still the same result.
>
> I really hope I'm missing some obvious optimization that's making it so slow 
> compared to the perl version, hence this email soliciting feedback.
>
> Thanks in advance.
>
> P.S. For reference, here's my corresponding perl implementation:
>
> --- start perl code ---
>
> #!/usr/bin/perl
> use strict;
> use warnings FATAL => 'all';
>
> my %urls;
> open(FILE, '<', $ARGV[0]);
> while(<FILE>) {
>     if (/ 08:/) {
>         my @words = split;
>         my $key = join(" ", ($words[6], $words[7], $words[8]));
>         if (exists $urls{$key}) { $urls{$key}++ }
>         else { $urls{$key} = 1 }
>     }
> }
> for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" 
> }
>
> --- end perl code ---
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [email protected]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



--
Eugene Kirpichov
Web IR developer, market.yandex.ru
_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to