That seems pretty reasonable to me.

Thanks for the friday night response!

-Jim

On Aug 22, 2014, at 9:34 PM, Jim Robinson <jrobi...@broadinstitute.org> wrote:

> Hmm,  despite what that format description says, in practice it looks like 
> the second column is a transcript id in the files downloadable from UCSC.  
> For example,  grepping the "refFlat.txt" file for hg38 as follows yields
> 
> grep EGFR refFlat.txt | cut -f 1-8
> 
> EGFR    NM_005228    chr7    +    55019031    55207338    55019277    55205617
> EGFR    NM_201284    chr7    +    55019031    55171045    55019277    55170544
> EGFR    NM_201283    chr7    +    55019031    55156951    55019277    55156843
> EGFR    NM_201282    chr7    +    55019031    55168635    55019277    55168529
> EGFR-AS1    NR_047551    chr7    -    55179749    55188949    55188949    
> 55188949
> 
> Jim
> 
> 
> 
>> I’ve been using Picard’s RefFlatReader for my own purposes, and have been 
>> home-rolling refFlat files.  The specification I’ve found is listed here:
>> http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#RefFlat
>> 
>> In particular, the first two fields of the file are:
>> string  geneName;           "Name of gene as it appears in Genome Browser."
>>     string  name;               "Name of gene”
>> 
>> This looks like the gene symbol, followed by the UCSC accession ID.
>> 
>> This is backed up by the UCSC FAQ:
>> Question: 
>> "I have the accession number for a gene and would like to link it to the 
>> gene name. Is there a table that shows both pieces of information?"
>> 
>> Response:
>> If you are looking at the RefSeq Genes, the refFlat table contains both the 
>> gene name (usually a HUGO Gene Nomenclature Committee ID) and its accession 
>> number. 
>> 
>> However, in the Picard implementation, the reader defines the columns as:
>> 
>> public enum RefFlatColumns{GENE_NAME, TRANSCRIPT_NAME, CHROMOSOME, STRAND, 
>> TX_START, TX_END, CDS_START, CDS_END,
>>         EXON_COUNT, EXON_STARTS, EXON_ENDS}
>> 
>> So, if the spec I listed is right, Picard interprets gene accession IDs as 
>> transcript names.  The problem here is that when a set of transcripts is 
>> being built for a gene, transcripts are identified by their name, and 
>> transcripts with the same name in the same gene are rejected.
>> 
>> if (transcripts.containsKey(name)) {
>>    throw new AnnotationException("Transcript " + name + " for gene " + 
>> this.getName() + " appears more than once");
>> }
>> 
>> Less amusingly, this exception is caught, and a debug level log is thrown 
>> that seems to be pretty easy to miss.
>> 
>> The following example (if I’ve implemented the spec correctly) is valid:
>> 
>> WASH7P  ENSG00000227232 1 -       14363   29370   14363   29370   10      
>> 14363,14970,15796,16607,16858,17233,17606,17915,24738,29321,    
>> 14829,15038,15947,16765,17055,17368,17742,18061,24891,29370,
>> WASH7P  ENSG00000227232 1 -       14363   29370   14363   29370   12      
>> 14363,14970,15796,15904,16607,16854,17233,17602,17915,18268,24738,29321,     
>>    14829,15038,15901,15947,16765,17055,17364,17742,18061,18379,24891,29370,
>> 
>> However, since these two transcripts (that are clearly different as they use 
>> different exons) share the same name, an exception is thrown, a message is 
>> logged to debug, and the GeneAnnotationReader.loadRefFlat() method returns a 
>> truncated object, as parsing effectively stops when this happens.
>> 
>> What I’d like to know is: 
>> 
>> 1) Did I misinterpret the specification for the refFlat file?  If I did, my 
>> bad and I’ll make the appropriate changes to adhere to the spec
>> 2) If not #1, is this a bug in the Picard implementation?  
>> 
>> Thanks for your help/advice.
>> 
>> -Jim Nemesh
>> 
>> 
>> ------------------------------------------------------------------------------
>> Slashdot TV.  
>> Video for Nerds.  Stuff that matters.
>> http://tv.slashdot.org/
>> 
>> 
>> _______________________________________________
>> Samtools-help mailing list
>> Samtools-help@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/samtools-help
> 
> ------------------------------------------------------------------------------
> Slashdot TV.  
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/_______________________________________________
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

Reply via email to