[miso-users] sashimi -- proposed code to use all reads

Sol Katzman Thu, 04 Jan 2018 11:04:31 -0800

Dear Sashimi Users,

*Summary*:
I have written some code to upgrade sashimi so that it uses all reads,
instead of filtering out reads with multiple splice junction and/or indels.


*Background*:
It is well documented that sashimi does not use all reads:

"For junction visualization, sashimi_plot currently uses only reads that
 cross a single junction. If a read crosses multiple exon-exon junctions,
 it is currently skipped, although MISO will use such a read in isoform
 estimation if it consistent with the given isoform annotation. Also,
 sashimi_plot currently ignores reads containing insertions or
 deletions and does not visualize sequence mismatches."

This has become more problematic as:
a) read length has increased, so that a higher fraction of  RNAseq reads span 
multiple junctions
b) exons of interest have decreased in length, so that RNAseq reads covering 
such exons are
   likely to span multiple junctions.

This can lead to situations where the sashimi plot does not accurately reflect 
the MISO call
of PSI. In particular, the discarding of reads with multiple junctions leads to 
a systematic
bias against the *included *isoform (in SE events), since the included form has 
an
internal exon flanked by 2 introns, while the excluded form has only one intron.

*Old Code*:
The relevant existing code in the MISO 0.5.4 release, using the Python Virtual 
Environment,
is found  in this file:

$misoVirtualenv/lib/python2.7/site-packages/misopy/sashimi_plot/plot_utils/plot_gene.py

The method "readsToWiggle_pysam()" first filters the reads, then calculates 
both:
a) the junction counts
b) the (mostly) exonic coverage

*New Code:*
Versions of pysam more recent than were apparently available at the time that 
sashimi
was originally written, have a function "find_introns()" for counting 
intron-spanning
reads from a set of aligned reads.
Pysam also has the function "get_reference_positions" for getting the positions 
in the reference covered by
an aligned reads.

Both of these 2 functions apparently are able to parse arbitrary cigar strings 
and give the
correct results. So I have used these functions to calculate the junction 
counts and coverage.

Note: the original code also used "get_reference_positions()", under its original name 
"positions()".
So it did not need to filter the reads for item (b).

Below find the new code. There are some commented out calls to the old code and 
print
statements, which for debugging purposes will compare the new and old results. 
Since the
new code was pretty short, I left it inline instead of using function calls.

I ran this with pysam 0.13.0  I don't know when "find_introns()" was first 
introduced.
I also did not test MISO with this version of pysam.

I hope others find this useful.

/Sol Katzman
--------------------------------------------------------------------------
In plot_gene.py, the beginning of function plot_density_single() is now:

    samfile = pysam.AlignmentFile(bam_filename, 'rb')
    try:
        # need several copies of iterator to be used with different methods
        subset_reads_copy1 = samfile.fetch(reference=chrom, start=tx_start, 
end=tx_end)
        subset_reads_copy2 = samfile.fetch(reference=chrom, start=tx_start, 
end=tx_end)
        subset_reads_copy3 = samfile.fetch(reference=chrom, start=tx_start, 
end=tx_end)
    except ValueError as e:
        print "Error retrieving reads from %s: %s" %(chrom, str(e))
        print "Are you sure %s appears in your BAM file?" %(chrom)
        print "Aborting plot..."
        return axvar

    # find_introns() returns a dictionary {(start, stop): count}
    # need to add 1 to stop for consistency with old code.
    # then join (start,stop) tuple to a string for consistency with old code.
    # use copy of reads iterator to get introns
    jxns = {}
    introns = samfile.find_introns(read for read in subset_reads_copy1 if "N" 
in read.cigarstring)
    for key, count in introns.iteritems():
        leftss  = key[0]
        rightss = key[1]+1
        if leftss > tx_start and leftss < tx_end \
               and rightss > tx_start and rightss < tx_end:
            jxn = ":".join(map(str, [leftss, rightss]))
            jxns[jxn] = count


    # use copy of reads iterator to get wiggle
    wiggle = zeros((tx_end - tx_start + 1), dtype='f')
    for read in subset_reads_copy2:
        # fetch that specifies a region should not return unaligned reads,
        # but legacy code had this check just in case?
        if read.cigartuples is None:
            print "Skipping unaligned read (with no CIGAR string) %s" % 
read.query_name
            continue
        read_align_length = read.query_alignment_length
        for pos in read.get_reference_positions() :
            if pos < tx_start or pos > tx_end:
                continue
            wig_index = pos-tx_start
            wiggle[wig_index] += 1.0/read_align_length

    # # for debug, use copy of reads iterator to get old versions of wiggle,jxns
    # oldwiggle, oldjxns = readsToWiggle_pysam(subset_reads_copy3, tx_start, 
tx_end)

    # print "============= oldjxns ================="
    # for key, value in oldjxns.iteritems():
    #    print "oldjxns: %s: %s" % (key, str(value))
    #
    # print "============= jxns ================="
    # for key, value in jxns.iteritems():
    #     print "jxns: %s: %s" % (key, str(value))

    # print "============= oldwiggle  %s entries (step 200) =================" 
% (len(oldwiggle))
    # for i in range(0, len(oldwiggle),200) :
    #    print "oldwiggle: %s: %s" % (str(i), str(oldwiggle[i]))

    # print "============= wiggle  %s entries (step 200) =================" % 
(len(wiggle))
    # for i in range(0, len(wiggle),200) :
    #    print "wiggle: %s: %s" % (str(i), str(wiggle[i]))

    wiggle = 1e3 * wiggle / coverage

_______________________________________________
miso-users mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/miso-users

[miso-users] sashimi -- proposed code to use all reads

Reply via email to