[pygr] Re: Next generation sequencing in Pygr

Namshin Kim Fri, 08 May 2009 08:56:08 -0700

What if we use cStringIO? Instead of heavy disk I/O during generating
.idDict, .seqIDdict, and Collection, we could make cStringIO then save into
file after it's done.
output = cStringIO.StringIO()
output.write('test\n') # DO ALL JOBS IN MEMORY RATHER THAN HDD
outfile = open('msa.idDict', 'wb')
outfile.write(output.getvalue()) # SAVE ALL AT ONCE
outfile.close()
output.close()


We could do some work on classutil.open_shelve... I think.

Yours,
Namshin Kim


On Fri, May 8, 2009 at 11:51 PM, Namshin Kim <n...@rna.kr> wrote:

> I am trying to make pairwise NLMSA from whole genome sequencing data and
> find out that pygr NLMSA spent most of time to generate .idDict and
> .seqIDdict rather than adding annotations (for tens of millions record,
> hours to finish generating .idDict and seqIDdict, tens of minutes to
> generate to add annotations). Maybe we could do a little more work on this?
> Maybe screed indexing? Saving annotation into Collection may slow...
>
>       seqIDdict[tmp]=(nlmsaID,nsID,offset) # SAVE THIS ENTRY
>       IDdict[str(nlmsaID)]=(tmp,nsID)
>
> Isn't is tuple saving of C integers? Correct me if I am wrong. In above
> case, does the shelve do serialization to save Python tuple object? What
> about saving C data structure... Just a thought. Or if the .idDict and
> .seqIDdict use a few gigabytes of memory, just store in memory and save
> later because recent linux servers have large memory...
>
> Namshin Kim
>
>
>
> On Sun, Apr 12, 2009 at 10:26 PM, Namshin Kim <n...@rna.kr> wrote:
>
>> One more thing to do is to make visualization tools for both AnnotationDB
>> & NLMSA.
>>
>>
>> On Sun, Apr 12, 2009 at 7:21 PM, Namshin Kim <n...@rna.kr> wrote:
>>
>>> Hi all,
>>> Here are some of my thoughts about what we should do in order to handle
>>> next generation sequencing data by pygr. I am working with data analysis
>>> platform for these kinds of data, extremely huge to handle without proper
>>> tools.
>>>
>>> Handling sequences with quality scores.
>>> There are two types of quality scores - Phred score (0 - 40, ord('I') -
>>> 33 gives you score) and Illumina/Solexa score (-40 - 40, ord('h') - 64 gives
>>> you score). We may need process those scores according to their types. I
>>> mean, at least supply one method to convert this score.
>>>
>>> Data types.
>>> Mostly, next generation sequencing data are FASTQ format with quality
>>> scores. In the case of AB SOLiD, the sequences may be represented by four
>>> digits - 0, 1, 2, and 3 followed by 'T'. We may need conversion tools for AB
>>> SOLiD. And, another important thing is how to handle paired-end sequencing
>>> data. Is the screed capable of handling paired-end sequencing data? I think
>>> saving one screed database instead of two. In that case, sequencing
>>> direction - RF for short library, FR for long library - should be saved.
>>> Sequencing length does not have to short because some of single molecule
>>> sequencing will give long reads. And, additional information may be saved
>>> along with sequences and qualities such as adaptors, restriction enzyme
>>> binding sites, and etc. We may need to expand dictionary keys in screed for
>>> these.
>>>
>>> Alignment tools and parsers.
>>> Until 2007, most of sequence alignment tools used genome hashing with
>>> 11bp word size (BLAST, BLAT) or 24bp (GMAP). In 2008, some of them used
>>> longer seed (ELAND2) or split query reads into several pieces (SOAP) with
>>> the expense of large memory requirements. But there is one problem when we
>>> try to analyze those outputs, is the output sorted or not? If the alignments
>>> are sorted, we can scan those files according to their orders, if not, we
>>> need a very powerful tool - pygr would be one of them - to extract
>>> information (variation, variome) out of them. In an effort to make *sorted*
>>> output, many tools use query hashing policy (MAQ, RMAP, ZOOM, SHRiMP,
>>> SeqMap). Actually, if we use sorted output, pygr NLMSA building procedure
>>> will be very fast. Let's think about human full genome re-sequencing data
>>> available public database (EBI ERA or NCBI SRA), average 20 - 30X. I have
>>> two 8-core 32GB RAM, 8TB servers. It would take months if we use one of the
>>> tools listed above. In 2009, some of the tools revolutionized alignment
>>> speed (SOAP2, BOWTIE, BWA), one step forward to next step, *personal genomes
>>> in a PC*. They used BWT (Burrows Wheeler Transformation). We can finish
>>> genome mapping within 10 days in a single desktop server, but the output is
>>> *not* sorted because they used genome hashing (BWT). So, the question is how
>>> many parsers we need. There is no standard format for short read alignments
>>> yet. One effort is done by BWA, it uses SAM format. In order to make pygr
>>> NLMSA, I used axtNet format, actually I converted alignments out of above
>>> softwares into very big axtNet format because pygr currently supports two
>>> formats, I mean C-level speed, MAF & axtNet and it is pairwise alignments
>>> onto the reference genome. I think we need NLMSA builder written in C and it
>>> should be customizable due to various non-standard output format.
>>> Converting/Saving into another format will be a waste of time and disk
>>> space.
>>>
>>> What information to save.
>>> Let's focus on re-sequencing project. What we have to see is the
>>> *variation*. It means, if the query reads are exactly same with reference
>>> genome, what we need out of them is just *coverage* and *position*. It can
>>> solved by pygr AnnotationDB because it saved only genomic position. But, if
>>> the query reads are not same, we should save those *variation* information.
>>> In this case, we can save as pygr pairwise NLMSA (supported only for axtNet
>>> format for now). But there are one little(?) issue to mention. If we build
>>> all NLMSA for 30X re-sequencing data, it would take much much bigger space
>>> because it consumes 24bytes to save one. What I am proposing is the
>>> combination of AnnotationDB and pairwise NLMSA, maybe it already is. Another
>>> thing is how can we handle insertion and deletion. NLMSA can handle indels
>>> via axtNet, but not AnnotationDB. Please correct me if I think wrong.
>>>
>>> Variation extraction from pygr.
>>> We need convenient methods/tools to extract information from pygr
>>> according to their types. Say, if we want to see only one base pair
>>> mismatches (candidates for SNPs) along with qualities, just one method like
>>> NLMSA.extract_single_basepair_variation(PhredScoreCutoff = 30) or something.
>>> Same as indels. For paired-end data, distance between them is very
>>> important. It can be long, or different chromosome due to chromosome
>>> rearrangement. Coverage information is important for copy number analysis. I
>>> used HDF5(h5py python module) for this purpose because I just need one byte
>>> for one nucleotide, as opposed to 24byes for one nucleotide (pygr does). We
>>> can make many many different kinds of convenient methods in order to extract
>>> information from next generation sequencing data according to their types.
>>>
>>> NLMSA join.
>>> This is one of major thing for 0.9 release. What we need is to query all
>>> Annotation/NLMSA databases at the same time and merge their outputs.
>>>
>>> Anyway, we need test dataset everybody can work with. I will add more
>>> things if it occurs to me. Please let me know what you think.
>>>
>>> Yours,
>>> Namshin Kim
>>>
>>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: Next generation sequencing in Pygr

Reply via email to