Thanks for the responses everyone, I'll try them out and see what happens :) Andrew
On Fri, Jun 8, 2012 at 4:40 PM, Johan Tibell <johan.tib...@gmail.com> wrote: > Hi Andrew, > > On Thu, Jun 7, 2012 at 5:39 PM, Andrew Myers <asm...@gmail.com> wrote: > > Hi Cafe, > > I'm working on inspecting some data that I'm trying to represent as > records > > in Haskell and seeing about twice the memory footprint than I was > > expecting. I've got roughly 1.4 million records in a CSV file (400M on > > disk) that I parse in using bytestring-csv. bytestring-csv returns a > > [[ByteString]] (wrapped in `type`s) which I then convert into a list of > > records that have the following structure: > > > >> 3 Int > >> 1 Text Length 3 > >> 1 Text Length 11 > >> 12 Float > >> 1 UTCTime > > > > All fields are marked strict and have {-# UNPACK #-} pragmas (I'm > guessing > > that doesn't do anything for non primitives). (Side note, is there a > way to > > check if things are actually being unpacked?) > > GHC used to complain when you use UNPACK with something that can't be > unpacked, but that warning seems to have been (accidentally) removed > in 7.4.1. > > The rule for unpacking is: > > * all product types (i.e. types with only one constructor) can be > unpacked. This includes Int, Char, Double, etc and tuples or records > their-of. > * sum types (i.e. data types with more than one constructor) and > polymorphic fields can't be unpacked. > > > My back of the napkin memory estimates based on the assumption that > nothing > > is being unpacked (and my very spotty understanding of Haskell data > > structures): > > > > Platform: 64 Bit Linux > > # Type (Sizeof type (occasionally a guess)) > > > > 3 * Int (8) > > 14 * Char (4) -- Text is some kind of bytestring but I'm guessing it > can't > > be worse than the same number of Char? > > 12 * Float (4) > > 18 * sizeOf (ptr) (8) > > UTC: -- From what I can gather through :info in ghci > > 4 * (ptr) (8) > > 2 * Integer (16) -- Shouldn't be overly large, times are within 2012 > > All fields in a constructor are word aligned. This means that all > primitive types take 8 bytes on a 64-bit platform, including Char and > Float. You might find the following blog posts by me useful in > computing the size of data structures: > > > http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html > http://blog.johantibell.com/2011/06/computing-size-of-hashmap.html > http://blog.johantibell.com/2011/11/slides-from-my-guest-lecture-at.html > > Here's some more on the topic: > > > http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types > > http://stackoverflow.com/questions/6574444/how-to-find-out-ghcs-memory-representations-of-data-types > > > I've written a small driver test program that just parses the CSV, finds > the > > minimum value for a couple of the Float fields, and exits. In the > process > > monitor the memory usage is 6.9G before the program exits. I've tried > > profiling with +RTS -hc but it ran for >3 hours without finishing, it > > normally finishes within 4 minutes. Anyone have any ideas for me? > Things > > to try? > > Thanks, > > Andrew > > You could try to use a 32-bit GHC, which would use about half the > memory. You're at the limit of the size of data that you can > comfortably fit in memory on a normal desktop machine, so it might be > time to consider a streaming approach. > > -- Johan >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe