gwern0: > Hey everyone. So I've been learning Haskell for a while now, and I've > found the best way to move from theory to practice is to just write > something useful for yourself. Now, I'm keen on editing Wikipedia and > I've long wanted some way to stop links to external websites from > breaking on me. So I wrote this little program using the TagSoup > library which will download Wikipedia articles, parse out external > links, and then ask WebCite to archive them. > > But there's a problem: no matter how I look at it, it's just way too > slow. Running on a measly 100 articles at a time, it'll eat up to half > my processor time and RAM (according to top). I converted it over to > ByteStrings since that's supposed to be a lot better than regular > Strings, but that didn't seem to help much. So I'm curious: in what > way could this code be better? How could it be more idiomatic or > shorter? Particularly, how could it be more efficient either in space > or time? Any comments are appreciate. > > {- Module : Main.hs > License : public domain > Maintainer : Gwern Branwen <[EMAIL PROTECTED]> > Stability : unstable > Portability : portable > Functionality: retrieve specified articles from Wikipedia and request > WebCite to archive all URLs found. > TODO: send an equivalent request to the Internet Archive. > Not in any way rate-limited. > BUGS: Issues redundant archive requests. > Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the > documentation right, this barfs > on the full UTF-8 character set, but Wikipedia definitely exercises > the full UTF-8 set. > USE: Print to stdin a succession of Wikipedia article names (whitespace in > names should be escaped as '_'). > A valid invocation might be, say: '$echo Fujiwara_no_Teika > Fujiwara_no_Shunzei | archive-bot' > All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would > then be backed up. > If you wanted to run this on all of Wikipedia, you could take the > current 'all-titles-in-ns0' > gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into > archive-bot. -} > > module Main where > import Text.HTML.TagSoup (parseTags, Tag(TagOpen)) > import Text.HTML.Download (openURL) > import Data.List (isPrefixOf) > import Monad (liftM) > import Data.Set (toList, fromList) > import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, > lines, unlines, pack, unpack, words) > > main :: IO () > main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< > (liftM B.words $ B.getContents)) > where sortNub :: [[B.ByteString]] -> [B.ByteString] > sortNub = toList . fromList . concat > > fetchArticleText :: B.ByteString -> IO [B.ByteString] > fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia > ++ B.unpack article)) > where wikipedia = "http://en.wikipedia.org/wiki/" > > extractURLs :: String -> B.ByteString > extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags > arg), (_,x) <- atts, "http://" `isPrefixOf` x]) > > archiveURL :: B.ByteString -> IO String > archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) > ++ emailAddress) > where emailAddress = "&[EMAIL PROTECTED]" >
you don't seem to be using bytestrings for anything important here -- you just pass them in, and immediately unpack them back to String anyway -- since tagsoup only downloads String, and parses String. Probably, as neil says, TagSoup just isn't optimised much yet. Perhaps try the bytestring-based urlcheck? http://hackage.haskell.org/cgi-bin/hackage-scripts/package/urlcheck-0.1 Neil, perhaps tagsoup should provide at the bottom a bytestring layer -- so there's some hope of efficient downloading, with a String layer on top -- not the other way around? -- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe