Re: [Haskell-cafe] Why is this so inefficient?

2008-02-06 Thread Bertram Felgenhauer
Jefferson Heard wrote:
 I thought this was fairly straightforward, but where the marked line
 finishes in 0.31 seconds on my machine, the actual transpose takes
 more than 5 minutes.  I know it must be possible to read data in
[snip]

 dataFromFile :: String - IO (M.Map String [S.ByteString])
 dataFromFile filename = do
 f - S.readFile filename
 print . length . map (S.split ',' $!) . S.lines $ f
  -- finishes in 0.31 seconds

The S.split applications will never be evaluated - the list that you produce
is full of thunks of the form (S.split ',' $! some bytestring) The $! will
only take effect if those thunks are forced, and length doesn't do that. Try

print . sum . map (length . S.split ',') . S.lines $ f

instead, to force S.split to produce a result. (In fact, S.split is strict
in its second argument, so the $! shouldn't have any effect on the running
time at all. I didn't measure that though.)

 return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
 this takes 5 minutes and 6 seconds

HTH,

Bertram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Why is this so inefficient?

2008-02-05 Thread Jefferson Heard
I thought this was fairly straightforward, but where the marked line
finishes in 0.31 seconds on my machine, the actual transpose takes
more than 5 minutes.  I know it must be possible to read data in
haskell faster than this.  I'm trying to read a 100MB comma delimited
file.  I've tried both CSV modules, and these take even longer to read
the file.  Is there some general best-practice for reading and parsing
large amounts of data that I'm not aware of?

I have tried, by the way, a couple of things, including putting a bang
(!) before row in transposeRow and using foldr instead of foldl', but
that didn't change anything other than force me to increase the stack
size to 100M on the command line.

I'm running in the profiler now, and I'll update this, but I thought I
would check and see if my head was on remotely straight to begin with.

-- Jeff

---
module ColumnMajorCSV where

import qualified Data.ByteString.Char8 as S
import qualified Data.Map as M
import qualified Data.List as L

transposeRow cols row = zipWith (:) row cols

transposeCSV :: [[S.ByteString]] - M.Map String [S.ByteString]
transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
where spreadsheet = L.foldl' transposeRow emptySheet rows
  emptySheet = take (length header) $ repeat []

dataFromFile :: String - IO (M.Map String [S.ByteString])
dataFromFile filename = do
f - S.readFile filename
print . length . map (S.split ',' $!) . S.lines $ f
 -- finishes in 0.31 seconds
return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
this takes 5 minutes and 6 seconds
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why is this so inefficient?

2008-02-05 Thread Don Stewart
If the  strings are relatively short, there can be a bottleneck
in the current Ord instance for Bytestrings. When lots of them
are in a map, the ffi calls to memcmp dominate. 

I've a fix for this (do it all in Haskell for small strings), and
can push it if someone complains some more.

jefferson.r.heard:
 I thought this was fairly straightforward, but where the marked line
 finishes in 0.31 seconds on my machine, the actual transpose takes
 more than 5 minutes.  I know it must be possible to read data in
 haskell faster than this.  I'm trying to read a 100MB comma delimited
 file.  I've tried both CSV modules, and these take even longer to read
 the file.  Is there some general best-practice for reading and parsing
 large amounts of data that I'm not aware of?
 
 I have tried, by the way, a couple of things, including putting a bang
 (!) before row in transposeRow and using foldr instead of foldl', but
 that didn't change anything other than force me to increase the stack
 size to 100M on the command line.
 
 I'm running in the profiler now, and I'll update this, but I thought I
 would check and see if my head was on remotely straight to begin with.
 
 -- Jeff
 
 ---
 module ColumnMajorCSV where
 
 import qualified Data.ByteString.Char8 as S
 import qualified Data.Map as M
 import qualified Data.List as L
 
 transposeRow cols row = zipWith (:) row cols
 
 transposeCSV :: [[S.ByteString]] - M.Map String [S.ByteString]
 transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) 
 spreadsheet)
 where spreadsheet = L.foldl' transposeRow emptySheet rows
   emptySheet = take (length header) $ repeat []
 
 dataFromFile :: String - IO (M.Map String [S.ByteString])
 dataFromFile filename = do
 f - S.readFile filename
 print . length . map (S.split ',' $!) . S.lines $ f
  -- finishes in 0.31 seconds
 return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
 this takes 5 minutes and 6 seconds
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why is this so inefficient?

2008-02-05 Thread Jefferson Heard
I've switched to this, which gets rid of the ByteString instances
fairly quickly.  None of them make it into the final map.  I'm still
not getting any faster response out of it, and it also complains that
my stack size is too small for anything about 128K records or more.

import qualified Data.ByteString.Char8 as S
import qualified Data.Map as M
import qualified Data.List as L

transposeRow cols row = zipWith (:) (map (read . S.unpack) $ row) cols

transposeCSV :: [[S.ByteString]] - M.Map String [Float]
transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
where spreadsheet = L.foldl' transposeRow emptySheet rows
  emptySheet = take (length header) $ repeat []

dataFromFile :: String - IO (M.Map String [Float])
dataFromFile filename = do
f - S.readFile filename
return . transposeCSV . map (S.split ',' $!) . S.lines $ f

On Tue, Feb 5, 2008 at 1:15 PM, Don Stewart [EMAIL PROTECTED] wrote:
 If the  strings are relatively short, there can be a bottleneck
  in the current Ord instance for Bytestrings. When lots of them
  are in a map, the ffi calls to memcmp dominate.

  I've a fix for this (do it all in Haskell for small strings), and
  can push it if someone complains some more.

  jefferson.r.heard:


  I thought this was fairly straightforward, but where the marked line
   finishes in 0.31 seconds on my machine, the actual transpose takes
   more than 5 minutes.  I know it must be possible to read data in
   haskell faster than this.  I'm trying to read a 100MB comma delimited
   file.  I've tried both CSV modules, and these take even longer to read
   the file.  Is there some general best-practice for reading and parsing
   large amounts of data that I'm not aware of?
  
   I have tried, by the way, a couple of things, including putting a bang
   (!) before row in transposeRow and using foldr instead of foldl', but
   that didn't change anything other than force me to increase the stack
   size to 100M on the command line.
  
   I'm running in the profiler now, and I'll update this, but I thought I
   would check and see if my head was on remotely straight to begin with.
  
   -- Jeff
  
   ---
   module ColumnMajorCSV where
  
   import qualified Data.ByteString.Char8 as S
   import qualified Data.Map as M
   import qualified Data.List as L
  
   transposeRow cols row = zipWith (:) row cols
  
   transposeCSV :: [[S.ByteString]] - M.Map String [S.ByteString]
   transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) 
 spreadsheet)
   where spreadsheet = L.foldl' transposeRow emptySheet rows
 emptySheet = take (length header) $ repeat []
  
   dataFromFile :: String - IO (M.Map String [S.ByteString])
   dataFromFile filename = do
   f - S.readFile filename
   print . length . map (S.split ',' $!) . S.lines $ f
-- finishes in 0.31 seconds
   return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
   this takes 5 minutes and 6 seconds
   ___
   Haskell-Cafe mailing list
   Haskell-Cafe@haskell.org
   http://www.haskell.org/mailman/listinfo/haskell-cafe




-- 
I try to take things like a crow; war and chaos don't always ruin a
picnic, they just mean you have to be careful what you swallow.

-- Jessica Edwards
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Why is this so inefficient?

2008-02-05 Thread Bryan O'Sullivan
Jefferson Heard wrote:
 I thought this was fairly straightforward, but where the marked line
 finishes in 0.31 seconds on my machine, the actual transpose takes
 more than 5 minutes.  I know it must be possible to read data in
 haskell faster than this.

I took a look into this, writing a similar, but simpler, program.  Half
of the runtime, and 2/3 of the allocation, is spent in ByteString's
split function.  The remaining portions are spent in transposing the list.

COST CENTRE   %time %alloc  ticks bytes
split  66.7   65.1 56 12013
xpose  31.0   32.8 26  60618031
read1.22.0  1   3640229
lines   1.20.1  1260002

I've attached two programs to demonstrate the problem.  One creates a
sample speadsheet; the other transposes it.

I spent a little time trying to find a faster replacement for
ByteString.split, but no luck before I had to return to other things.

b
import Data.List (foldl', transpose)
import qualified Data.ByteString.Char8 as C
import qualified Data.Map as M
import System.Environment (getArgs)

xpose name = do
sheet - (transpose
   .  {-# SCC split #-} map (C.split ',')
   .  {-# SCC lines #-} C.lines)
   `fmap` {-# SCC read #-}  C.readFile name
let m = foldl' go M.empty sheet
print (M.size m)
  where go m (x:xs) = {-# SCC go #-} M.insert x xs m

main = getArgs = mapM_ xpose
import Data.List
import System.IO
import System.Random

rint = show `fmap` (randomRIO (0,100) :: IO Int)

dump cols rows name = do
  h - openFile name WriteMode
  sequence_ . take rows . repeat $ do
cs - sequence . take cols . repeat $ rint
hPutStrLn h . concat . intersperse , $ cs
  hClose h

main = dump 1000 1 dump.csv
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe