[Samtools-help] Is this a bug in implementation, or my misunderstanding of the spec?

James Nemesh Fri, 22 Aug 2014 18:18:07 -0700

I’ve been using Picard’s RefFlatReader for my own purposes, and have been 
home-rolling refFlat files.  The specification I’ve found is listed here:
http://genome.ucsc.edu/goldenPath/gbdDescriptionsOld.html#RefFlat


In particular, the first two fields of the file are:
string  geneName;           "Name of gene as it appears in Genome Browser."
    string  name;               "Name of gene”

This looks like the gene symbol, followed by the UCSC accession ID.

This is backed up by the UCSC FAQ:
Question: 
"I have the accession number for a gene and would like to link it to the gene 
name. Is there a table that shows both pieces of information?"

Response:
If you are looking at the RefSeq Genes, the refFlat table contains both the 
gene name (usually a HUGO Gene Nomenclature Committee ID) and its accession 
number. 

However, in the Picard implementation, the reader defines the columns as:

public enum RefFlatColumns{GENE_NAME, TRANSCRIPT_NAME, CHROMOSOME, STRAND, 
TX_START, TX_END, CDS_START, CDS_END,
        EXON_COUNT, EXON_STARTS, EXON_ENDS}

So, if the spec I listed is right, Picard interprets gene accession IDs as 
transcript names.  The problem here is that when a set of transcripts is being 
built for a gene, transcripts are identified by their name, and transcripts 
with the same name in the same gene are rejected.

if (transcripts.containsKey(name)) {
   throw new AnnotationException("Transcript " + name + " for gene " + 
this.getName() + " appears more than once");
}

Less amusingly, this exception is caught, and a debug level log is thrown that 
seems to be pretty easy to miss.

The following example (if I’ve implemented the spec correctly) is valid:

WASH7P  ENSG00000227232 1 -       14363   29370   14363   29370   10      
14363,14970,15796,16607,16858,17233,17606,17915,24738,29321,    
14829,15038,15947,16765,17055,17368,17742,18061,24891,29370,
WASH7P  ENSG00000227232 1 -       14363   29370   14363   29370   12      
14363,14970,15796,15904,16607,16854,17233,17602,17915,18268,24738,29321,        
14829,15038,15901,15947,16765,17055,17364,17742,18061,18379,24891,29370,

However, since these two transcripts (that are clearly different as they use 
different exons) share the same name, an exception is thrown, a message is 
logged to debug, and the GeneAnnotationReader.loadRefFlat() method returns a 
truncated object, as parsing effectively stops when this happens.

What I’d like to know is: 

1) Did I misinterpret the specification for the refFlat file?  If I did, my bad 
and I’ll make the appropriate changes to adhere to the spec
2) If not #1, is this a bug in the Picard implementation?  

Thanks for your help/advice.

-Jim Nemesh

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help

[Samtools-help] Is this a bug in implementation, or my misunderstanding of the spec?

Reply via email to