[pygr] Re: Next generation sequencing in Pygr

Namshin Kim Fri, 08 May 2009 07:51:02 -0700

I am trying to make pairwise NLMSA from whole genome sequencing data and
find out that pygr NLMSA spent most of time to generate .idDict and
.seqIDdict rather than adding annotations (for tens of millions record,
hours to finish generating .idDict and seqIDdict, tens of minutes to
generate to add annotations). Maybe we could do a little more work on this?
Maybe screed indexing? Saving annotation into Collection may slow...


      seqIDdict[tmp]=(nlmsaID,nsID,offset) # SAVE THIS ENTRY
      IDdict[str(nlmsaID)]=(tmp,nsID)

Isn't is tuple saving of C integers? Correct me if I am wrong. In above
case, does the shelve do serialization to save Python tuple object? What
about saving C data structure... Just a thought. Or if the .idDict and
.seqIDdict use a few gigabytes of memory, just store in memory and save
later because recent linux servers have large memory...

Namshin Kim



On Sun, Apr 12, 2009 at 10:26 PM, Namshin Kim <n...@rna.kr> wrote:

> One more thing to do is to make visualization tools for both AnnotationDB &
> NLMSA.
>
>
> On Sun, Apr 12, 2009 at 7:21 PM, Namshin Kim <n...@rna.kr> wrote:
>
>> Hi all,
>> Here are some of my thoughts about what we should do in order to handle
>> next generation sequencing data by pygr. I am working with data analysis
>> platform for these kinds of data, extremely huge to handle without proper
>> tools.
>>
>> Handling sequences with quality scores.
>> There are two types of quality scores - Phred score (0 - 40, ord('I') - 33
>> gives you score) and Illumina/Solexa score (-40 - 40, ord('h') - 64 gives
>> you score). We may need process those scores according to their types. I
>> mean, at least supply one method to convert this score.
>>
>> Data types.
>> Mostly, next generation sequencing data are FASTQ format with quality
>> scores. In the case of AB SOLiD, the sequences may be represented by four
>> digits - 0, 1, 2, and 3 followed by 'T'. We may need conversion tools for AB
>> SOLiD. And, another important thing is how to handle paired-end sequencing
>> data. Is the screed capable of handling paired-end sequencing data? I think
>> saving one screed database instead of two. In that case, sequencing
>> direction - RF for short library, FR for long library - should be saved.
>> Sequencing length does not have to short because some of single molecule
>> sequencing will give long reads. And, additional information may be saved
>> along with sequences and qualities such as adaptors, restriction enzyme
>> binding sites, and etc. We may need to expand dictionary keys in screed for
>> these.
>>
>> Alignment tools and parsers.
>> Until 2007, most of sequence alignment tools used genome hashing with 11bp
>> word size (BLAST, BLAT) or 24bp (GMAP). In 2008, some of them used longer
>> seed (ELAND2) or split query reads into several pieces (SOAP) with the
>> expense of large memory requirements. But there is one problem when we try
>> to analyze those outputs, is the output sorted or not? If the alignments are
>> sorted, we can scan those files according to their orders, if not, we need a
>> very powerful tool - pygr would be one of them - to extract information
>> (variation, variome) out of them. In an effort to make *sorted* output, many
>> tools use query hashing policy (MAQ, RMAP, ZOOM, SHRiMP, SeqMap). Actually,
>> if we use sorted output, pygr NLMSA building procedure will be very
>> fast. Let's think about human full genome re-sequencing data available
>> public database (EBI ERA or NCBI SRA), average 20 - 30X. I have two 8-core
>> 32GB RAM, 8TB servers. It would take months if we use one of the tools
>> listed above. In 2009, some of the tools revolutionized alignment speed
>> (SOAP2, BOWTIE, BWA), one step forward to next step, *personal genomes in a
>> PC*. They used BWT (Burrows Wheeler Transformation). We can finish genome
>> mapping within 10 days in a single desktop server, but the output is *not*
>> sorted because they used genome hashing (BWT). So, the question is how many
>> parsers we need. There is no standard format for short read alignments yet.
>> One effort is done by BWA, it uses SAM format. In order to make pygr NLMSA,
>> I used axtNet format, actually I converted alignments out of above softwares
>> into very big axtNet format because pygr currently supports two formats, I
>> mean C-level speed, MAF & axtNet and it is pairwise alignments onto the
>> reference genome. I think we need NLMSA builder written in C and it should
>> be customizable due to various non-standard output format. Converting/Saving
>> into another format will be a waste of time and disk space.
>>
>> What information to save.
>> Let's focus on re-sequencing project. What we have to see is the
>> *variation*. It means, if the query reads are exactly same with reference
>> genome, what we need out of them is just *coverage* and *position*. It can
>> solved by pygr AnnotationDB because it saved only genomic position. But, if
>> the query reads are not same, we should save those *variation* information.
>> In this case, we can save as pygr pairwise NLMSA (supported only for axtNet
>> format for now). But there are one little(?) issue to mention. If we build
>> all NLMSA for 30X re-sequencing data, it would take much much bigger space
>> because it consumes 24bytes to save one. What I am proposing is the
>> combination of AnnotationDB and pairwise NLMSA, maybe it already is. Another
>> thing is how can we handle insertion and deletion. NLMSA can handle indels
>> via axtNet, but not AnnotationDB. Please correct me if I think wrong.
>>
>> Variation extraction from pygr.
>> We need convenient methods/tools to extract information from pygr
>> according to their types. Say, if we want to see only one base pair
>> mismatches (candidates for SNPs) along with qualities, just one method like
>> NLMSA.extract_single_basepair_variation(PhredScoreCutoff = 30) or something.
>> Same as indels. For paired-end data, distance between them is very
>> important. It can be long, or different chromosome due to chromosome
>> rearrangement. Coverage information is important for copy number analysis. I
>> used HDF5(h5py python module) for this purpose because I just need one byte
>> for one nucleotide, as opposed to 24byes for one nucleotide (pygr does). We
>> can make many many different kinds of convenient methods in order to extract
>> information from next generation sequencing data according to their types.
>>
>> NLMSA join.
>> This is one of major thing for 0.9 release. What we need is to query all
>> Annotation/NLMSA databases at the same time and merge their outputs.
>>
>> Anyway, we need test dataset everybody can work with. I will add more
>> things if it occurs to me. Please let me know what you think.
>>
>> Yours,
>> Namshin Kim
>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] Re: Next generation sequencing in Pygr

Reply via email to