[med-svn] [rsem] 01/03: Imported Upstream version 1.2.22+dfsg

Michael Crusoe Thu, 27 Aug 2015 13:43:27 -0700

This is an automated email from the git hooks/post-receive script.

misterc-guest pushed a commit to branch master
in repository rsem.


commit 3a9288dbae8176742283fc83a5d058f202641ab6
Author: Michael R. Crusoe <[email protected]>
Date:   Thu Aug 27 13:37:24 2015 -0700

    Imported Upstream version 1.2.22+dfsg
---
 EBSeq/rsem-for-ebseq-find-DE |   7 +-
 PairedEndHit.h               |   4 +-
 PairedEndQModel.h            |   1 +
 README.md                    |  20 +++---
 SamParser.h                  |  23 +++++--
 WHAT_IS_NEW                  |  18 ++++++
 rsem-calculate-expression    | 150 ++++++++++++++++++++++++++++++++++++++++++-
 rsem-prepare-reference       |  72 +++++++++++++++++++--
 8 files changed, 268 insertions(+), 27 deletions(-)

diff --git a/EBSeq/rsem-for-ebseq-find-DE b/EBSeq/rsem-for-ebseq-find-DE
index f228bab..a448843 100755
--- a/EBSeq/rsem-for-ebseq-find-DE
+++ b/EBSeq/rsem-for-ebseq-find-DE
@@ -10,6 +10,7 @@ path <- argv[1]
 ngvector_file <- argv[2]
 data_matrix_file <- argv[3]
 output_file <- argv[4]
+norm_out_file <- paste0("norm_", data_matrix_file)
 
 nc <- length(argv) - 4;
 num_reps <- as.numeric(argv[5:(5+nc-1)])
@@ -23,6 +24,7 @@ if (sum(num_reps) != n) stop("Total number of replicates 
given does not match th
 
 conditions <- as.factor(rep(paste("C", 1:nc, sep=""), times = num_reps))
 Sizes <- MedianNorm(DataMat)
+NormMat <- GetNormalizedMat(DataMat, Sizes)
 ngvector <- NULL
 if (ngvector_file != "#") {
   ngvector <- as.vector(data.matrix(read.table(ngvector_file)))
@@ -37,8 +39,8 @@ if (nc == 2) {
   PP <- as.data.frame(GetPPMat(EBOut))
   fc_res <- PostFC(EBOut)
 
-  results <- cbind(PP, fc_res$PostFC, fc_res$RealFC)
-  colnames(results) <- c("PPEE", "PPDE", "PostFC", "RealFC")
+  results <- cbind(PP, fc_res$PostFC, 
fc_res$RealFC,unlist(EBOut$C1Mean)[rownames(PP)], 
unlist(EBOut$C2Mean)[rownames(PP)])
+  colnames(results) <- c("PPEE", "PPDE", "PostFC", "RealFC","C1Mean","C2Mean")
   results <- results[order(results[,"PPDE"], decreasing = TRUE),]
   write.table(results, file = output_file, sep = "\t")
   
@@ -68,3 +70,4 @@ if (nc == 2) {
   MultiFC <- GetMultiFC(MultiOut)
   write.table(MultiFC$CondMeans[ord,], file = paste(output_file, ".condmeans", 
sep = ""), sep = "\t")
 }
+write.table(NormMat, file = norm_out_file, sep = "\t")
diff --git a/PairedEndHit.h b/PairedEndHit.h
index 1242cf7..be3ea54 100644
--- a/PairedEndHit.h
+++ b/PairedEndHit.h
@@ -15,13 +15,13 @@ public:
                this->insertL = insertL;
        }
 
-       short getInsertL() const { return insertL; }
+       int getInsertL() const { return insertL; }
 
        bool read(std::istream&);
        void write(std::ostream&);
 
 private:
-       short insertL; // insert length
+       int insertL; // insert length
 };
 
 bool PairedEndHit::read(std::istream& in) {
diff --git a/PairedEndQModel.h b/PairedEndQModel.h
index 7aebb49..3864c80 100644
--- a/PairedEndQModel.h
+++ b/PairedEndQModel.h
@@ -126,6 +126,7 @@ public:
                const SingleReadQ& mate2 = read.getMate2();
                int m2pos = totLen - pos - insertLen;
                int m2dir = !dir;
+
                prob *= mld->getAdjustedProb(mate2.getReadLength(), 
hit.getInsertL()) *
                        qpro->getProb(mate2.getReadSeq(), mate2.getQScore(), 
ref, m2pos, m2dir);
 
diff --git a/README.md b/README.md
index dedffcd..4636079 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 README for RSEM
 ===============
 
-[Bo Li](http://pages.cs.wisc.edu/~bli) \(bli at cs dot wisc dot edu\)
+[Bo Li](http://bli25ucb.github.io/) \(bli at cs dot wisc dot edu\)
 
 * * *
 
@@ -89,7 +89,7 @@ To prepare the reference sequences, you should run the
     rsem-prepare-reference --help
 
 to get usage information or visit the [rsem-prepare-reference
-documentation 
page](http://deweylab.biostat.wisc.edu/rsem/rsem-prepare-reference.html).
+documentation page](rsem-prepare-reference.html).
 
 ### II. Calculating Expression Values
 
@@ -99,7 +99,7 @@ To calculate expression values, you should run the
     rsem-calculate-expression --help
 
 to get usage information or visit the [rsem-calculate-expression
-documentation 
page](http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html).
+documentation page](rsem-calculate-expression.html).
 
 #### Calculating expression values from single-end data
 
@@ -130,7 +130,7 @@ using an alternative aligner, you may also want to provide 
the
 '--no-bowtie' option to 'rsem-prepare-reference' so that the Bowtie
 indices are not built.
 
-RSEM requires all alignments of the same read group together. For
+RSEM requires the alignments of a read to be adjacent. For
 paired-end reads, RSEM also requires the two mates of any alignment be
 adjacent. To check if your SAM/BAM file satisfy the requirements,
 please run
@@ -145,7 +145,7 @@ process. Please run
 
 to get usage information or visit the [convert-sam-for-rsem
 documentation
-page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html).
+page](convert-sam-for-rsem.html).
 
 However, please note that RSEM does ** not ** support gapped
 alignments. So make sure that your aligner does not produce alignments
@@ -228,7 +228,7 @@ To generate transcript wiggle plots, you should run the
     rsem-plot-transcript-wiggles --help
 
 to get usage information or visit the [rsem-plot-transcript-wiggles
-documentation 
page](http://deweylab.biostat.wisc.edu/rsem/rsem-plot-transcript-wiggles.html).
+documentation page](rsem-plot-transcript-wiggles.html).
 
 #### e) Visualize the model learned by RSEM
 
@@ -397,7 +397,7 @@ transcripts whose lengths are less than k are assigned to 
cluster
 
 to get usage information or visit the [rsem-generate-ngvector
 documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-generate-ngvector.html).
+page](rsem-generate-ngvector.html).
 
 If your reference is a de novo assembled transcript set, you should
 run 'rsem-generate-ngvector' first. Then load the resulting
@@ -430,7 +430,7 @@ for all genes/transcripts. Run
     rsem-run-ebseq --help
 
 to get usage information or visit the [rsem-run-ebseq documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-run-ebseq.html). Second,
+page](rsem-run-ebseq.html). Second,
 'rsem-control-fdr' takes 'rsem-run-ebseq' 's result and reports called
 differentially expressed genes/transcripts by controlling the false
 discovery rate. Run
@@ -438,7 +438,7 @@ discovery rate. Run
     rsem-control-fdr --help
 
 to get usage information or visit the [rsem-control-fdr documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-control-fdr.html). These
+page](rsem-control-fdr.html). These
 two scripts can perform DE analysis on either 2 conditions or multiple
 conditions.
 
@@ -452,7 +452,7 @@ be sent to <a href="mailto:[email protected]";>Ning Leng</a>.
 
 ## <a name="authors"></a> Authors
 
-The RSEM algorithm is developed by Bo Li and Colin Dewey. The RSEM software is 
mainly implemented by Bo Li.
+[Bo Li](http://bli25ucb.github.io/) and [Colin 
Dewey](https://www.biostat.wisc.edu/~cdewey/) designed the RSEM algorithm. [Bo 
Li](http://bli25ucb.github.io/) implemented the RSEM software. [Peng 
Liu](https://www.biostat.wisc.edu/~cdewey/group.html) contributed the STAR 
aligner options.
 
 ## <a name="acknowledgements"></a> Acknowledgements
 
diff --git a/SamParser.h b/SamParser.h
index 6e90765..9fa0342 100644
--- a/SamParser.h
+++ b/SamParser.h
@@ -23,6 +23,8 @@
 
 #include "Transcripts.h"
 
+const char* whitespace = " \t\n\r\f\v";
+
 class SamParser {
 public:
        SamParser(char, const char*, const char*, Transcripts&, const char*);
@@ -62,7 +64,16 @@ private:
        }
 
        std::string getName(const bam1_t* b) {
-               return std::string(bam1_qname(b));
+        const char* raw_query_name = bam1_qname(b);
+        // Retain only the first whitespace-delimited word as the read name
+        // This prevents issues of mismatching names when aligners do not
+        // strip off extra words in read name strings
+        const char* whitespace_pos = std::strpbrk(raw_query_name, whitespace);
+        if (whitespace_pos == NULL) {
+            return std::string(raw_query_name);
+        } else {
+            return std::string(raw_query_name, whitespace_pos - 
raw_query_name);
+        }
        }
 
        std::string getReadSeq(const bam1_t*);
@@ -190,10 +201,11 @@ int SamParser::parseNext(PairedEndRead& read, 
PairedEndHit& hit) {
                mp1 = b2; mp2 = b;
        }
 
-       if (strcmp(bam1_qname(mp1), bam1_qname(mp2))) printf("Warning: Detected 
a read pair whose two mates have different names--%s and %s!\n", 
getName(mp1).c_str(), getName(mp2).c_str()); 
+       std::string name = getName(mp1);
+       std::string name2 = getName(mp2);
+       if (name != name2) printf("Warning: Detected a read pair whose two 
mates have different names--%s and %s!\n", name.c_str(), name2.c_str());
 
        int readType = getReadType(mp1, mp2);
-       std::string name = getName(mp1);
 
        if (readType != 1 || (readType == 1 && read.getName().compare(name) != 
0)) {
                val = readType;
@@ -244,10 +256,11 @@ int SamParser::parseNext(PairedEndReadQ& read, 
PairedEndHit& hit) {
                mp1 = b2; mp2 = b;
        }
 
-       if (strcmp(bam1_qname(mp1), bam1_qname(mp2))) printf("Warning: Detected 
a read pair whose two mates have different names--%s and %s!\n", 
getName(mp1).c_str(), getName(mp2).c_str()); 
+       std::string name = getName(mp1);
+       std::string name2 = getName(mp2);
+       if (name != name2) printf("Warning: Detected a read pair whose two 
mates have different names--%s and %s!\n", name.c_str(), name2.c_str());
 
        int readType = getReadType(mp1, mp2);
-       std::string name = getName(mp1);
 
        if (readType != 1 || (readType == 1 && read.getName().compare(name) != 
0)) {
                val = readType;
diff --git a/WHAT_IS_NEW b/WHAT_IS_NEW
index a508fac..68fdc30 100644
--- a/WHAT_IS_NEW
+++ b/WHAT_IS_NEW
@@ -1,3 +1,21 @@
+RSEM v1.2.22
+
+- Added options to run the STAR aligner
+
+--------------------------------------------------------------------------------------------
+
+RSEM v1.2.21
+
+- Strip read names of extra words to avoid mismatches of paired-end read names
+
+--------------------------------------------------------------------------------------------
+
+RSEM v1.2.20
+
+- Fixed a problem that can lead to assertion error if any paired-end read's 
insert size > 32767 (by changing the type of insertL in PairedEndHit.h from 
short to int)
+
+--------------------------------------------------------------------------------------------
+
 RSEM v1.2.19
 
 - Modified 'rsem-prepare-reference' such that by default it does not add any 
poly(A) tails. To add poly(A) tails, use '--polyA' option
diff --git a/rsem-calculate-expression b/rsem-calculate-expression
index 585732d..ea6279e 100755
--- a/rsem-calculate-expression
+++ b/rsem-calculate-expression
@@ -2,7 +2,7 @@
 
 use Getopt::Long;
 use Pod::Usage;
-
+use File::Basename;
 use FindBin;
 use lib $FindBin::RealBin;
 use rsem_perl_utils qw(runCommand collectResults showVersionInfo);
@@ -88,6 +88,14 @@ my $gap = 32;
 
 my $alleleS = 0;
 
+my $star = 0;
+my $star_path  = '';
+my $gzipped_read_file = 0;
+my $bzipped_read_file = 0;
+my $output_star_genome_bam = 0;
+my $sort_bam_by_read_name = 0;
+my $sort_bam_buffer_size = '60G';
+
 GetOptions("keep-intermediate-files" => \$keep_intermediate_files,
           "temporary-folder=s" => \$temp_dir,
           "no-qualities" => \$no_qual,
@@ -131,6 +139,13 @@ GetOptions("keep-intermediate-files" => 
\$keep_intermediate_files,
           "ci-memory=i" => \$NMB,
           "ci-number-of-samples-per-count-vector=i" => \$NSPC,
           "samtools-sort-mem=s" => \$SortMem,
+          'star' => \$star,
+          'star-path=s' => \$star_path,
+          "gzipped-read-file" => \$gzipped_read_file,
+          "bzipped-read-file" => \$bzipped_read_file,
+          "output-star-genome-bam" => \$output_star_genome_bam,
+          "sort-bam-by-read-name" => \$sort_bam_by_read_name,
+          "sort-bam-buffer-size=s" => \$sort_bam_buffer_size,
           "seed=i" => \$seed,
           "time" => \$mTime,
           "version" => \$version,
@@ -168,6 +183,8 @@ pod2usage(-msg => "--sampling-for-bam cannot be specified 
if --no-bam-output is
 pod2usage(-msg => "--output-genome-bam cannot be specified if --no-bam-output 
is specified!\n", -exitval => 2, -verbose => 2) if ($genGenomeBamF && 
!$genBamF);
 pod2usage(-msg => "The seed for random number generator must be a non-negative 
32bit integer!\n", -exitval => 2, -verbose => 2) if (($seed ne "NULL") && 
($seed < 0 || $seed > 0xffffffff));
 pod2usage(-msg => "The credibility level should be within (0, 1)!\n", -exitval 
=> 2, -verbose => 2) if ($CONFIDENCE <= 0.0 || $CONFIDENCE >= 1.0);
+pod2usage(-msg => "Gzipped read files can only be used for aligner STAR\n", 
-exitval=>2, -verbose =>2) if ( ( $star == 0 ) && ( $gzipped_read_file != 0));
+pod2usage(-msg => "Bzipped read files can only be used for aligner STAR\n", 
-exitval=>2, -verbose =>2) if ( ( $star == 0 ) && ( $bzipped_read_file != 0));
 
 if ($L < 25) { print "Warning: the seed length set is less than 25! This is 
only allowed if the references are not added poly(A) tails.\n"; }
 
@@ -229,11 +246,64 @@ my ($mate_minL, $mate_maxL) = (1, $maxL);
 
 if ($bowtie_path ne "") { $bowtie_path .= "/"; }
 if ($bowtie2_path ne "") { $bowtie2_path .= "/"; }
+if ($star_path ne '') { $star_path .= '/'; }
 
 my $command = "";
 
 if (!$is_sam && !$is_bam) {
-    if (!$bowtie2) {
+       if ( $star ) {
+         ## align reads by STAR
+    my $star_genome_path = dirname($refName);
+         $command = "$star_path/STAR " . 
+                      ## ENCODE3 pipeline parameters
+                      " --genomeDir $star_genome_path " .
+                      ' --outSAMunmapped Within ' .
+                      ' --outFilterType BySJout ' .
+                      ' --outSAMattributes NH HI AS NM MD ' .
+                      ' --outFilterMultimapNmax 20 ' .
+                      ' --outFilterMismatchNmax 999 ' .
+                      ' --outFilterMismatchNoverLmax 0.04 ' .
+                      ' --alignIntronMin 20 ' .
+                      ' --alignIntronMax 1000000 ' .
+                      ' --alignMatesGapMax 1000000 ' .
+                      ' --alignSJoverhangMin 8 ' .
+                      ' --alignSJDBoverhangMin 1 ' .
+                      ' --sjdbScore 1 ' .
+                      " --runThreadN $nThreads " .
+                      ##
+       
+                      ## different than ENCODE3 pipeline 
+                      ## do not allow using shared memory
+                      ' --genomeLoad NoSharedMemory ' .
+                      ##
+       
+                      ## different than ENCODE3 pipeline, which sorts output 
BAM
+                      ## no need to do it here to save time and memory 
+                      ' --outSAMtype BAM Unsorted ' .
+                      ##
+       
+                      ## unlike ENCODE3, we don't output bedGraph files
+       
+                      ' --quantMode TranscriptomeSAM '.
+                      ' --outSAMheaderHD \@HD VN:1.4 SO:unsorted '.
+       
+                      ## define output file prefix
+                      " --outFileNamePrefix $imdName ";
+                      ##
+       
+         if ( $gzipped_read_file ) {
+           $command .= ' --readFilesCommand zcat ';
+         } elsif ( $bzipped_read_file ) {
+           $command .= ' --readFilesCommand bzip2 -c ';
+         }
+       
+         if ( $read_type == 0 || $read_type == 1 ) {
+           $command .= " --readFilesIn $mate1_list ";
+         } else {
+           $command .= " --readFilesIn $mate1_list $mate2_list";
+         }
+       
+       } elsif (!$bowtie2) {
        $command = $bowtie_path."bowtie";
        if ($no_qual) { $command .= " -f"; }
        else { $command .= " -q"; }
@@ -307,6 +377,40 @@ if (!$is_sam && !$is_bam) {
 
     $inpF = "$imdName.bam";
     $is_bam = 1; # alignments are outputed as a BAM file
+
+    if ( $star ) {
+      my $star_tr_bam = $imdName . 'Aligned.toTranscriptome.out.bam';
+      rename $star_tr_bam, $inpF
+        or die "can't rename $star_tr_bam to $inpF: $!\n";
+      rmdir $imdName . "_STARtmp/";
+      my $star_genome_bam = $imdName . "Aligned.out.bam";
+      my $rsem_star_genome_bam = $sampleName.'.STAR.genome.bam';
+      if ( $output_star_genome_bam ) {
+        rename $star_genome_bam, $rsem_star_genome_bam or die 
+          "can't move $star_genome_bam to $rsem_star_genome_bam: $!\n";
+      } else {
+        unlink $star_genome_bam or die "can't remove $star_genome_bam: $!\n";
+      }
+    }
+}
+
+if ( $sort_bam_by_read_name ) {
+  my $sorted_bam = "$imdName.mateSorted.bam";
+  my $sort_workdir = dirname($imdName);
+  $command = "cat <( samtools view -H $inpF ) " .
+                 "<( samtools view -@ $nThreads $inpF | " .
+                 q(awk '{printf "%s ", $0; getline; print}' | ) .
+                 "sort -S $sort_bam_buffer_size -T $sort_workdir | " .
+                 q(tr ' ' '\n' ) .
+                 ") | samtools view -@ $nThreads -bS - > $sorted_bam"; 
+
+  print "$command\n\n";
+
+  ## have to invoke BASH to use the <()<() substitutions
+  ## have to use LIST to invoke BASE
+  system( ('bash', '-c', $command) );
+
+  rename $sorted_bam, $inpF or die "can't rename $sorted_bam to $inpF: $!\n";
 }
 
 if ($mTime) { $time_start = time(); }
@@ -520,6 +624,14 @@ The RNA-Seq protocol used to generate the reads is strand 
specific, i.e., all (u
 
 Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM does not 
handle indel, local and discordant alignments, the Bowtie2 parameters are set 
in a way to avoid those alignments. In particular, we use options '--sensitive 
--dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The 
last parameter of '--score-min', '-0.1', is the negative of maximum mismatch 
rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are 
paired-end, we additional [...]
 
+=item B<--star>
+
+Use STAR to align reads. Alignment parameters are from ENCODE3's STAR-RSEM 
pipeline. To save computational time and memory resources, STAR's Output BAM 
file is unsorted. It is stored in RSEM's temporary directory with name as 
'sample_name.bam'. Each STAR job will have its own private copy of the genome 
in memory. (Default: off) 
+
+=item B<--star-path> <path>
+
+The path to STAR's executable. (Default: the path to STAR executable is 
assumed to be in user's PATH environment variable)
+
 =item B<--sam>
 
 Input file is in SAM format. (Default: off)
@@ -634,6 +746,26 @@ Input quality scores are solexa encoded (from GA Pipeline 
ver. < 1.3). (Default:
 
 (Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end mode. This 
option controls how hard Bowtie 2 tries to find alignments. <string> must be 
one of "very_fast", "fast", "sensitive" and "very_sensitive". The four 
candidates correspond to Bowtie 2's "--very-fast", "--fast", "--sensitive" and 
"--very-sensitive" options. (Default: "sensitive" - use Bowtie 2's default)
 
+=item B<--gzipped-read-file>
+
+Input read file(s) is compressed by gzip. This option can be only used when 
aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)
+
+=item B<--bzipped-read-file>
+
+Input read file(s) is compressed by bzip2. This option can be only used when 
aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)
+
+=item B<--output-star-genome-bam>
+
+Save the BAM file from STAR alignment under genomic coordinate to 
'sample_name.STAR.genome.bam'. This file is NOT sorted by genomic coordinate. 
In this file, according to STAR's manual, 'paired ends of an alignment are 
always adjacent, and multiple alignments of a read are adjacent as well'. 
(Default: off)
+
+=item B<--sort-bam-by-read-name>
+
+Sort BAM file aligned under transcript coordidate by read name. Setting this 
option on will produce determinstic maximum likelihood estimations from 
independet runs. Note that sorting will take long time and lots of memory. 
(Default: off)
+
+=item B<--sort-bam-buffer-size> <string>
+
+Size for main memeory buffer when sorting BAM file. It can be any string 
acceptable to GNU sort's '-S' option. See "sort --help" for details. (Default: 
'60G')
+
 =item B<--forward-prob> <double>
 
 Probability of generating a read from the forward strand of a transcript. Set 
to 1 for a strand-specific protocol where all (upstream) reads are derived from 
the forward strand, 0 for a strand-specific protocol where all (upstream) read 
are derived from the reverse strand, or 0.5 for a non-strand-specific protocol. 
(Default: 0.5)
@@ -706,7 +838,7 @@ Output time consumed by each step of RSEM to 
'sample_name.time'. (Default: off)
 
 =head1 DESCRIPTION
 
-In its default mode, this program aligns input reads against a reference 
transcriptome with Bowtie and calculates expression values using the 
alignments.  RSEM assumes the data are single-end reads with quality scores, 
unless the '--paired-end' or '--no-qualities' options are specified.  Users may 
use an alternative aligner by specifying one of the --sam and --bam options, 
and providing an alignment file in the specified format. However, users should 
make sure that they align against the [...]
+In its default mode, this program aligns input reads against a reference 
transcriptome with Bowtie and calculates expression values using the 
alignments.  RSEM assumes the data are single-end reads with quality scores, 
unless the '--paired-end' or '--no-qualities' options are specified. 
Alternatively, users can use STAR to align reads using the '--star' option. 
RSEM has provided options in 'rsem-prepare-reference' to prepare STAR's genome 
indices. Users may use an alternative aligner by  [...]
 
 One simple way to make the alignment file satisfying RSEM's requirements 
(assuming the aligner used put mates in a paired-end read adjacent) is to use 
'convert-sam-for-rsem' script. This script only accept SAM format files as 
input. If a BAM format file is obtained, please use samtools to convert it to a 
SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and 
the SAM file is named 'input.sam', you can run the following command: 
 
@@ -962,4 +1094,16 @@ Assume the path to the bowtie executables is in the 
user's PATH environment vari
                            /ref/mouse_125 \
                            mmliver_paired_end_quals
 
+6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads with 
quality scores and read files are compressed by gzip. We want to use STAR to 
aligned reads and assume STAR executable is '/sw/STAR'. Suppose we want to use 
8 threads and do not generate a genome BAM file:
+
+ rsem-calculate-expression --star \
+                           --star-path /sw/STAR \
+                           --gzipped-read-file \
+                           -p 8 \
+                           /data/mmliver_1.fq.gz \
+                           /data/mmliver_2.fq.gz \
+                           /ref/mouse_125 \
+                           mmliver_paired_end_quals
+
+
 =cut
diff --git a/rsem-prepare-reference b/rsem-prepare-reference
index 2b33f89..8e5a3d9 100755
--- a/rsem-prepare-reference
+++ b/rsem-prepare-reference
@@ -2,6 +2,7 @@
 
 use Getopt::Long;
 use Pod::Usage;        
+use File::Basename;
 use FindBin;
 use lib $FindBin::RealBin;
 use rsem_perl_utils;
@@ -10,6 +11,7 @@ use Env qw(@PATH);
 @PATH = ($FindBin::RealBin, @PATH);
 
 use strict;
+use warnings;
 
 my $status;
 
@@ -28,6 +30,11 @@ my $help = 0;
 
 my $alleleMappingF = "";
 
+my $star = 0;
+my $star_path = '';
+my $star_nthreads = 1;
+my $star_sjdboverhang = 100;
+
 GetOptions("gtf=s" => \$gtfF,
           "transcript-to-gene-map=s" => \$mappingF,
           "allele-to-gene-map=s" => \$alleleMappingF,
@@ -38,6 +45,10 @@ GetOptions("gtf=s" => \$gtfF,
           "bowtie-path=s" => \$bowtie_path,
           "bowtie2" => \$bowtie2,
           "bowtie2-path=s" => \$bowtie2_path,
+          'star' => \$star,
+          'star-path=s' => \$star_path,
+          'star-sjdboverhang=i' => \$star_sjdboverhang,
+          'p|num-threads=i' => \$star_nthreads,
           "q|quiet" => \$quiet,
           "h|help" => \$help) or pod2usage(-exitval => 2, -verbose => 2);
 
@@ -71,6 +82,7 @@ if ($polyA) {
 
 if ($bowtie_path ne "") { $bowtie_path .= "/"; }
 if ($bowtie2_path ne "") { $bowtie2_path .= "/"; }
+if ($star_path ne '') { $star_path .= '/'; }
 
 my $command = "";
 
@@ -114,6 +126,19 @@ if ($bowtie2) {
     &runCommand($command);
 }
 
+if ($star) {
+    my $out_star_genome_path = dirname($ARGV[1]);
+    $command = "$star_path/STAR " .
+                        " --runThreadN $star_nthreads " .
+                        ' --runMode genomeGenerate ' .
+                        " --genomeDir $out_star_genome_path " .
+                        " --genomeFastaFiles @list " .
+                        " --sjdbGTFfile $gtfF " .
+                        " --sjdbOverhang $star_sjdboverhang " .
+                        " --outFileNamePrefix $ARGV[1]";
+    &runCommand($command);
+}
+
 __END__
 
 =head1 NAME
@@ -206,6 +231,22 @@ Build Bowtie 2 indices. (Default: off)
 
 The path to the Bowtie 2 executables. (Default: the path to Bowtie 2 
executables is assumed to be in the user's PATH environment variable)
 
+=item B<--star>
+
+Build STAR indices. (Default: off)
+
+=item B<--star-path> <path>
+
+The path to STAR's executable. (Default: the path to STAR executable is 
assumed to be in user's PATH environment varaible)
+
+=item B<--star-sjdboverhang> <int>
+
+Length of the genomic sequence around annotated junction. It is only used for 
STAT to build splice junctions database and not needed for Bowtie or Bowtie2. 
It will be passed as the --sjdbOverhang option to STAR. According to STAR's 
manual, its ideal value is max(ReadLength)-1, e.g. for 2x101 paired-end reads, 
the ideal value is 101-1=100. In most cases, the default value of 100 will work 
as well as the ideal value. (Default: 100)
+
+=item B<-p/--num-threads> <int>
+
+Number of threads to use for building STAR's genome indices. (Default: 1)
+
 =item B<-q/--quiet>
 
 Suppress the output of logging information. (Default: off)
@@ -218,18 +259,18 @@ Show help information.
 
 =head1 DESCRIPTION
 
-This program extracts/preprocesses the reference sequences for RSEM. It can 
optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 
indices (with '--bowtie2' option) using their default parameters. If an 
alternative aligner is to be used, indices for that particular aligner can be 
built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see 
OUTPUT for details). This program is used in conjunction with the 
'rsem-calculate-expression' program.
+This program extracts/preprocesses the reference sequences for RSEM. It can 
optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 
indices (with '--bowtie2' option) using their default parameters. It can also 
optionally build STAR indices (with '--star' option) using parameters from 
ENCODE3's STAR-RSEM pipeline. If an alternative aligner is to be used, indices 
for that particular aligner can be built from either 'reference_name.idx.fa' or 
'reference_name.n2g.idx.fa' (se [...]
 
 =head1 OUTPUT
 
-This program will generate 'reference_name.grp', 'reference_name.ti', 
'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' 
(if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa' and 
optional Bowtie/Bowtie 2 index files.
+This program will generate 'reference_name.grp', 'reference_name.ti', 
'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' 
(if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa', 
optional Bowtie/Bowtie 2 index files, and optional STAR index files.
 
 'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and 
'reference_name.chrlist' are used by RSEM internally.
 
 B<'reference_name.transcripts.fa'> contains the extracted reference 
transcripts in Multi-FASTA format. Poly(A) tails are not added and it may 
contain lower case bases in its sequences if the corresponding genomic regions 
are soft-masked.
 
-B<'reference_name.idx.fa' and 'reference_name.n2g.idx.fa'> are used by 
aligners to build their own indices. In these two files, all sequence bases are 
converted into upper case. In addition, poly(A) tails are added if '--polyA' 
option is set. The only difference between 'reference_name.idx.fa' and 
'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition 
converts all 'N' characters to 'G' characters. This conversion is in particular 
desired for aligners (e.g. Bowtie) th [...]
- 
+B<'reference_name.idx.fa' and 'reference_name.n2g.idx.fa'> are used by 
aligners to build their own indices. In these two files, all sequence bases are 
converted into upper case. In addition, poly(A) tails are added if '--polyA' 
option is set. The only difference between 'reference_name.idx.fa' and 
'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition 
converts all 'N' characters to 'G' characters. This conversion is in particular 
desired for aligners (e.g. Bowtie) th [...]
+
 =head1 EXAMPLES
 
 1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of 
the mouse genome. We have downloaded the UCSC Genes transcript annotations in 
GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file 
for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in 
the directory '/data/mm9'.  We want to put the generated reference files under 
'/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that 
GTF files generated fr [...]
@@ -263,11 +304,32 @@ OR
                         /data/mm9 \
                         /ref/mouse_0
 
-3) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and 
isoform-gene information stored in 'mapping.txt'. We want to add 125bp long 
poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In 
addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an 
alternative aligner to align reads against either 'mouse_125.idx.fa' or 
'mouse_125.idx.n2g.fa':
+3) Suppose we want to build STAR indices in the above example and save index 
files under '/ref' with name 'mouse_0'. Assuming STAR executable is '/sw/STAR', 
the command will be:
+
+ rsem-prepare-reference --gtf mm9.gtf \
+                        --star \
+                        --star-path /sw/STAR \
+                        -p 8 \
+                        
/data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
+                        /ref/mouse_0
+
+OR
+
+ rsem-prepare-reference --gtf mm9.gtf \
+                        --star \
+                        --star-path /sw/STAR \
+                        -p 8 \
+                        /data/mm9
+                        /ref/mouse_0
+
+STAR genome index files will be saved under '/ref/'. 
+
+4) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and 
isoform-gene information stored in 'mapping.txt'. We want to add 125bp long 
poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In 
addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an 
alternative aligner to align reads against either 'mouse_125.idx.fa' or 
'mouse_125.idx.n2g.fa':
 
  rsem-prepare-reference --transcript-to-gene-map mapping.txt \
                         --polyA
                         mm9.fasta \
                         mouse_125
 
+
 =cut

-- 
Alioth's /usr/local/bin/git-commit-notice on 
/srv/git.debian.org/git/debian-med/rsem.git

_______________________________________________
debian-med-commit mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-med-commit

[med-svn] [rsem] 01/03: Imported Upstream version 1.2.22+dfsg

Reply via email to