Hi, On Tue, Apr 28, 2009 at 10:08 AM, David Henderson <[email protected]>wrote:
> > My apologies for a tardy reply. I'll address all of the questions in this > e-mail, rather than reply multiple times. > > 1) I used file to determine that the .exe files were 32bit. It is entirely > possible that file returns 32bit for all .exe, rather than examine the file. > 2) Is there a way to store char/string data as something smaller than > UTF-16? The data are SNP genotypes, i.e. a single SNP genotype looks like A > T and there are almost a million of these per individual. I'm thinking that > what I need to do is record the genotype as bits, i.e. 0 or 1, and relate > that back to a translation class thet returns A or T when that SNP is > queried. It would be simpler if I could store char/string data as something > reasonably small. Use the BitArray class. That's exactly what it's for. If it's possible for you to store your genotype using bits as opposed to strings you'll *vastly* reduce your memory requirements. Alan. 3) What I'm currently doing is: > a) read in each line as a single string which is split based upon > whitespace > b) input each SNP into a class which is stored in an ArrayList, or as a > string array in a List<string> (I've implemented it both ways) > c) once the while file is read in, output each collection of SNPs by > chromosome to a different file for processing by other software > > I've been able to get past my initial problem by re-compiling mono with the > large heap size GC and when the entire data is read in, it takes up 17GB RAM > for a 300MB file. I know I'm new to mono/C#, but I've been programming in > C++ for years and have written many commerical applications for large data > and nothing I've written to date has been as memory hungry as this. I'm > hoepful I can get some good suggestions on how to improve performance. > > Thanks!! > > Dave H > > > > ----- Original Message ---- > From: Jonathan Pryor <[email protected]> > To: dnadavewa <[email protected]> > Cc: [email protected] > Sent: Friday, April 24, 2009 12:14:12 PM > Subject: Re: [Mono-list] 64bit gmcs/mcs in SLES/openSuSE rpms? > > On Thu, 2009-04-23 at 14:20 -0700, dnadavewa wrote: > > I'm working on a large data problem where I'm reading in data from text > files > > with almost 2 million columns. In doing this, I can read in about 25 > rows > > before Mono bombs out with an out of memory error. > > How are you reading in these lines? > > > What I found was the mono executable was indeed 64 bit, but gmcs.exe and > > mcs.exe were 32 bit. > > As Chris Howie mentioned, these are actually in platform-neutral IL, and > will be run using a 64-bit address space when using `mono`. > > > One other point, memory usage is horrible. I admit that I'm new to C# > and > > mono, so my coding skills are not as good as others, but a 300MB file > should > > not use 2GB RAM to read in 1/8 of the file. > > That depends ~entirely on how you're reading in the file. > > Also keep in mind that .NET strings are UTF-16, so if your input text is > ASCII, you will require twice as much RAM as the size of the file, e.g. > 600MB of RAM to store the entire file as a string. (Then there is > various object overhead considerations, but these are likely tiny > compared to the 300MB you're looking at.) > > > I stopped using classes to > > store the data and went with List<string> and List<string[]> to read in > this > > much data. Any comments on how I might improve this performance would be > > appreciated. > > To provide any comments we'd need to know more about what you're trying > to do. For example, reading a 300MB XML file using XmlDocument will > require *lots* of RAM, as in addition to the UTF-16 string issue, each > element, attribute, etc. will be represented as separate objects, with > varying amounts of memory required. DOM would be something to avoid > here, while XmlReader would be much better. > > The easiest question, though, is this: do you really need to keep the > entire file contents in memory all at once? > > Or can you instead process each line independently (or while caching > minimal data from one line to the next, so that the contents of previous > lines don't need to be maintained). This would allow you to remove your > List<string>, and save a ton of memory. > > - Jon > > > > _______________________________________________ > Mono-list maillist - [email protected] > http://lists.ximian.com/mailman/listinfo/mono-list >
_______________________________________________ Mono-list maillist - [email protected] http://lists.ximian.com/mailman/listinfo/mono-list
