Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Morgan, Martin Sat, 09 Jan 2016 08:14:06 -0800

We switched to TwoBitFile with a recent ensembl release, thinking that it had 
better performance and other characteristics compared to the previous FaFile.


The 'recipe' used to create the FaFiles did not explicitly trim the label; that 
appears to be something done by Rsamtools::indexFa and hence (a now quite 
dated) version of samtools.

I'm not precisely sure where we stand on correcting this. The original approach 
just takes what we're given and makes a 2bit file. At least provisionally we 
had decided (after Thurs / Fri exchanges) to make the seqlevels sensible on the 
way in to annotation hub; this is against Sean's advice and I'm not really a 
big fan of this. 

I like the idea of being able to dynamically remap the seqlevels when the 2bit 
file is loaded by AnnotationHub, which would require Herve's suggestion of 
settable seqlevels on TwoBitFile.

Martin
________________________________________
From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
Johannes [johannes.rai...@eurac.edu]
Sent: Saturday, January 09, 2016 11:01 AM
To: Hervé Pagès
Cc: Michael Lawrence; Martin Morgan
Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Yes, using BSGenome would help in this case.
In the long run I think it might be important to have this fixed, not 
necessarily for human, but for other species/genome builds for which there 
might not be an BSGenome package available; through AnnotationHub all GTF files 
and fasta files would be available. Note also that the FaFiles from Ensembl do 
have the “correct” chromosome names although I assume they were built from the 
same Ensembl fasta files than the TwoBitFiles.

jo

> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:
>
> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>> That is one solution. But everyone using that genome would need to
>> reset the seqlevels to the "standard" ones. In this specific case, is
>> there any reason not to just use the BSgenome for GRCh38?
>
> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
> use case. Just wanted to mention that the ability to rename the
> sequences in a TwoBitFile, FastaFile, or other file-based object that
> supports seqinfo() would be useful in general.
>
> H.
>
>>
>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>> Hi Jo, Michael,
>>>
>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>> you need for this is an extra slot for storing the user-supplied
>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>> them. But you don't need to support all that. Supporting renaming would
>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>>
>>> H.
>>>
>>>
>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>>>
>>>> I agree, I would not modify the file content. At present it is however not
>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome 
>>>> names
>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
>>>> but the first name-part would help here...
>>>>
>>>> jo
>>>>
>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seand...@gmail.com> wrote:
>>>>>
>>>>> I will make the small editorial comment to guard against modifying file
>>>>> content on transit into the hub object. On the client side (after getting
>>>>> such an object) I think a “fix” would be to have a quick seqnames method 
>>>>> to
>>>>> strip off all but the first whitespace delimited piece.
>>>>>
>>>>> Sean
>>>>>
>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence <lawrence.mich...@gene.com>
>>>>>> wrote:
>>>>>>
>>>>>> This is perhaps something that could be handled when population the
>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>>>>> chromosome names.
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>>>> <johannes.rai...@eurac.edu> wrote:
>>>>>>>
>>>>>>> dear all,
>>>>>>>
>>>>>>> I just run into a problem with a TwoBitFile I fetched from
>>>>>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA 
>>>>>>> sequence, as
>>>>>>> provided by Ensembl:
>>>>>>>
>>>>>>>> library(AnnotationHub)
>>>>>>>> ah <- AnnotationHub()
>>>>>>>> tbf <- ah[["AH50068”]]
>>>>>>>
>>>>>>>
>>>>>>>> head(seqnames(seqinfo(tbf)))
>>>>>>>
>>>>>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>>>>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>>>>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>>>>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>>>>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>>>>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>>>>>>
>>>>>>> Would be nice, if the seqnames would be really just the chromsome names
>>>>>>> and not the whole string from the FA file header. Is there a way I 
>>>>>>> could fix
>>>>>>> the file myself or is this something that should be fixed in the 
>>>>>>> rtracklayer
>>>>>>> or AnnotationHub package when the TwoBitFile is created?
>>>>>>>
>>>>>>> thanks, jo
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel@r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel@r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpa...@fredhutch.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Reply via email to