What if we use cStringIO? Instead of heavy disk I/O during generating .idDict, .seqIDdict, and Collection, we could make cStringIO then save into file after it's done. output = cStringIO.StringIO() output.write('test\n') # DO ALL JOBS IN MEMORY RATHER THAN HDD outfile = open('msa.idDict', 'wb') outfile.write(output.getvalue()) # SAVE ALL AT ONCE outfile.close() output.close()
We could do some work on classutil.open_shelve... I think. Yours, Namshin Kim On Fri, May 8, 2009 at 11:51 PM, Namshin Kim <n...@rna.kr> wrote: > I am trying to make pairwise NLMSA from whole genome sequencing data and > find out that pygr NLMSA spent most of time to generate .idDict and > .seqIDdict rather than adding annotations (for tens of millions record, > hours to finish generating .idDict and seqIDdict, tens of minutes to > generate to add annotations). Maybe we could do a little more work on this? > Maybe screed indexing? Saving annotation into Collection may slow... > > seqIDdict[tmp]=(nlmsaID,nsID,offset) # SAVE THIS ENTRY > IDdict[str(nlmsaID)]=(tmp,nsID) > > Isn't is tuple saving of C integers? Correct me if I am wrong. In above > case, does the shelve do serialization to save Python tuple object? What > about saving C data structure... Just a thought. Or if the .idDict and > .seqIDdict use a few gigabytes of memory, just store in memory and save > later because recent linux servers have large memory... > > Namshin Kim > > > > On Sun, Apr 12, 2009 at 10:26 PM, Namshin Kim <n...@rna.kr> wrote: > >> One more thing to do is to make visualization tools for both AnnotationDB >> & NLMSA. >> >> >> On Sun, Apr 12, 2009 at 7:21 PM, Namshin Kim <n...@rna.kr> wrote: >> >>> Hi all, >>> Here are some of my thoughts about what we should do in order to handle >>> next generation sequencing data by pygr. I am working with data analysis >>> platform for these kinds of data, extremely huge to handle without proper >>> tools. >>> >>> Handling sequences with quality scores. >>> There are two types of quality scores - Phred score (0 - 40, ord('I') - >>> 33 gives you score) and Illumina/Solexa score (-40 - 40, ord('h') - 64 gives >>> you score). We may need process those scores according to their types. I >>> mean, at least supply one method to convert this score. >>> >>> Data types. >>> Mostly, next generation sequencing data are FASTQ format with quality >>> scores. In the case of AB SOLiD, the sequences may be represented by four >>> digits - 0, 1, 2, and 3 followed by 'T'. We may need conversion tools for AB >>> SOLiD. And, another important thing is how to handle paired-end sequencing >>> data. Is the screed capable of handling paired-end sequencing data? I think >>> saving one screed database instead of two. In that case, sequencing >>> direction - RF for short library, FR for long library - should be saved. >>> Sequencing length does not have to short because some of single molecule >>> sequencing will give long reads. And, additional information may be saved >>> along with sequences and qualities such as adaptors, restriction enzyme >>> binding sites, and etc. We may need to expand dictionary keys in screed for >>> these. >>> >>> Alignment tools and parsers. >>> Until 2007, most of sequence alignment tools used genome hashing with >>> 11bp word size (BLAST, BLAT) or 24bp (GMAP). In 2008, some of them used >>> longer seed (ELAND2) or split query reads into several pieces (SOAP) with >>> the expense of large memory requirements. But there is one problem when we >>> try to analyze those outputs, is the output sorted or not? If the alignments >>> are sorted, we can scan those files according to their orders, if not, we >>> need a very powerful tool - pygr would be one of them - to extract >>> information (variation, variome) out of them. In an effort to make *sorted* >>> output, many tools use query hashing policy (MAQ, RMAP, ZOOM, SHRiMP, >>> SeqMap). Actually, if we use sorted output, pygr NLMSA building procedure >>> will be very fast. Let's think about human full genome re-sequencing data >>> available public database (EBI ERA or NCBI SRA), average 20 - 30X. I have >>> two 8-core 32GB RAM, 8TB servers. It would take months if we use one of the >>> tools listed above. In 2009, some of the tools revolutionized alignment >>> speed (SOAP2, BOWTIE, BWA), one step forward to next step, *personal genomes >>> in a PC*. They used BWT (Burrows Wheeler Transformation). We can finish >>> genome mapping within 10 days in a single desktop server, but the output is >>> *not* sorted because they used genome hashing (BWT). So, the question is how >>> many parsers we need. There is no standard format for short read alignments >>> yet. One effort is done by BWA, it uses SAM format. In order to make pygr >>> NLMSA, I used axtNet format, actually I converted alignments out of above >>> softwares into very big axtNet format because pygr currently supports two >>> formats, I mean C-level speed, MAF & axtNet and it is pairwise alignments >>> onto the reference genome. I think we need NLMSA builder written in C and it >>> should be customizable due to various non-standard output format. >>> Converting/Saving into another format will be a waste of time and disk >>> space. >>> >>> What information to save. >>> Let's focus on re-sequencing project. What we have to see is the >>> *variation*. It means, if the query reads are exactly same with reference >>> genome, what we need out of them is just *coverage* and *position*. It can >>> solved by pygr AnnotationDB because it saved only genomic position. But, if >>> the query reads are not same, we should save those *variation* information. >>> In this case, we can save as pygr pairwise NLMSA (supported only for axtNet >>> format for now). But there are one little(?) issue to mention. If we build >>> all NLMSA for 30X re-sequencing data, it would take much much bigger space >>> because it consumes 24bytes to save one. What I am proposing is the >>> combination of AnnotationDB and pairwise NLMSA, maybe it already is. Another >>> thing is how can we handle insertion and deletion. NLMSA can handle indels >>> via axtNet, but not AnnotationDB. Please correct me if I think wrong. >>> >>> Variation extraction from pygr. >>> We need convenient methods/tools to extract information from pygr >>> according to their types. Say, if we want to see only one base pair >>> mismatches (candidates for SNPs) along with qualities, just one method like >>> NLMSA.extract_single_basepair_variation(PhredScoreCutoff = 30) or something. >>> Same as indels. For paired-end data, distance between them is very >>> important. It can be long, or different chromosome due to chromosome >>> rearrangement. Coverage information is important for copy number analysis. I >>> used HDF5(h5py python module) for this purpose because I just need one byte >>> for one nucleotide, as opposed to 24byes for one nucleotide (pygr does). We >>> can make many many different kinds of convenient methods in order to extract >>> information from next generation sequencing data according to their types. >>> >>> NLMSA join. >>> This is one of major thing for 0.9 release. What we need is to query all >>> Annotation/NLMSA databases at the same time and merge their outputs. >>> >>> Anyway, we need test dataset everybody can work with. I will add more >>> things if it occurs to me. Please let me know what you think. >>> >>> Yours, >>> Namshin Kim >>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to pygr-dev@googlegroups.com To unsubscribe from this group, send email to pygr-dev+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---