hi folks --

a haskell newbie here, searching for comments and wisdom on my code.

i had a project to try to implement "external sort" in haskell as a
learning exercise.  (external sort is sorting a list that is too large
to fit in main memory, by sorting in chunks, spooling to disk, and
then merging.  more properly there probably should be multiple stages,
but for simplicity i'm doing a one-stage external sort.)

the trick is the number of files can quickly grow very large, so it is
best to use one large file and seek inside it around.  however as one
can imagine the order-of-IO-operations becomes a bit tricky, if you're
seeking file handles around underneath Data.ByteString.Lazy's nose.
but late this night after not thinking about it for a while i had a
brainstorm: rewrite hGetContents to keep the handle position in the
right place!  it's all about judicious use of unsafeInterleaveIO.....

it seems to be rather fast, strangely faster than the usual "sort" at
times.  it also seems to have nice memory characteristics, though not
perfect.  it's hard to test because the normal "sort" function takes
too much RAM on large lists, making my computer swap like mad.

i'd appreciate any testing, comments and suggestions from the haskell
gods out there.  my thanks to the Data.ByteString.Lazy, Data.Binary,
and Data.Edison people, who made this rather easy, once I grokked
unsafeInterleaveIO.

thanks and take care, B

module ExternalSort where

Sort a list of Ords "offline."  We're doing this to be able to sort
things without taking up too much memory (for example sorting lists
too large to fit in RAM.)  Laziness is imperative, as is the
order-of-operations.

import Control.Monad
import Data.List
import qualified Data.Binary as Bin
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString as P (hGetNonBlocking, null)
import Data.ByteString.Base (LazyByteString(LPS))
import Foreign.Storable (sizeOf)
import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput,
                  Handle, IOMode(ReadMode, WriteMode),
                  SeekMode(AbsoluteSeek))
import System.IO.Unsafe (unsafeInterleaveIO)

import qualified Data.Edison.Seq.ListSeq as LS
import qualified Data.Edison.Coll.SplayHeap as Splay

Conceptually, we sort a list in blocks, spool blocks to disk, then
merge back.  However for IO performance it is better to read off
chunks of elements off the sorted blocks from disk instead of
elements-at-a-time.

It would be better if these were in KBytes instead of # of elements.

blocksize :: Int
blocksize = 10000

Turn a list into a list of chunks.

slice :: Int -> [a] -> [[a]]
slice _ [] = []
slice size l = (take size l) : (slice size $ drop size l)

Turn a list into a list of blocks, each of which is sorted.

blockify :: (Ord a) => Int -> [a] -> [[a]]
blockify bsize l = map sort $ slice bsize l

Serialize a block, returning the (absolute) position of the start.

dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer
dumpBlock h b = do
  start <- hTell h
  B.hPut h $ Bin.encode b
  return start

The actual sorting function.  We blockify the list, turning it into a
list of sorted blocks, and spool to disk, keeping track of offsets.
We then read back the blocks (lazily!), and merge them.

externalSort [] = do return []
externalSort l = do
  h <- openFile "ExternalSort.bin" WriteMode
  idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
  hClose h
  h <- openFile "ExternalSort.bin" ReadMode
  blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x;
                            return $ Bin.decode bs}) idx
  return (kMerge $ blocks)

Merging chunks.  K-way merge (and in fact external sort in general) is
detailed in Knuth, where he recommends tournament trees.  The easiest
thing is to probably use one of Okasaki's heaps.  I'll use splay
heaps, because I don't know any better.

It would be better if I changed Ord for blocks to only check the first
element.

kMerge :: (Ord a) => [[a]] -> [a]
kMerge [] = []
kMerge l =
    let h = Splay.fromSeq l in
    kM (Splay.minElem h) (Splay.deleteMin h)
    where
    kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
    kM l h
        | h == Splay.empty = l
        | otherwise =
            let next = Splay.minElem h
                (f, b) = span (\x -> x <= head next) l
            in
            f ++ (kM next (if null b then Splay.deleteMin h
                           else (Splay.insert b $ Splay.deleteMin h)))

kMergeSort :: (Ord a) => [a] -> [a]
kMergeSort l = kMerge $ blockify blocksize l

This is a version of hGetContents which resets its handle position
between reads, so is safe to use with interleaved handle seeking.

hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString
hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize

hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString
hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS
  where
    lazyRead start = unsafeInterleaveIO $ loop start

    loop start = do
        hSeek h AbsoluteSeek start
        ps <- P.hGetNonBlocking h k
        --TODO: I think this should distinguish EOF from no data available
        -- the otherlying POSIX call makes this distincion, returning either
        -- 0 or EAGAIN
        if P.null ps
          then do eof <- hIsEOF h
                  if eof then return []
                         else hWaitForInput h (-1)
                           >> (loop start)
           else do
              pos <- hTell h
              pss <- lazyRead pos
              return (ps : pss)

defaultChunkSize :: Int
defaultChunkSize = 32 * k - overhead
   where k = 1024
         overhead = 2 * sizeOf (undefined :: Int)
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to