Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Michael Lawrence Sun, 10 Jan 2016 10:29:31 -0800

Yes, thanks for those details, Herve. I changed rtracklayer to take
the first word as the seqlevels.


Michael

On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes
<johannes.rai...@eurac.edu> wrote:
> That would be great! I don’t think we would loose any information with that 
> behaviour. All fasta files from Ensembl have that format (id, whitespace and 
> description). Implementing that would render the fasta files from Ensembl 
> provided as TwoBit files via AnnotationHub usable without having to tweak the 
> objects afterwards. I would highly appreciate that (and people working with 
> other species than mouse/human/rat too)!
>
> jo
>
>> On 09 Jan 2016, at 21:19, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>
>> On 01/09/2016 08:42 AM, Michael Lawrence wrote:
>>> I can understand the desire to avoid defining and enforcing our own
>>> standards on third-party data: it's error-prone, potentially
>>> confusing, etc. But the same is even more true of expecting the user
>>> to perform the mapping via some adhoc approach.
>>>
>>> It's unfortunate that Ensembl does not follow the convention of naming
>>> their FASTA sequences by their seqlevels, but I'm not sure how
>>> wide-spread that convention is in the first place.
>>
>> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
>> it seems that:
>>
>>  The word following the ">" symbol is the identifier of the sequence,
>>  and the rest of the line is the description (both are optional).
>>
>> so Rsamtools::indexFa is doing the right thing by trimming the
>> description line. Maybe that's what seqinfo,TwoBitFile should do
>> too?
>>
>> H.
>>
>>>
>>> Why does Bioconductor distribute genomes in two different ways:
>>> BSgenome and via AnnotationHub? Couldn't those two distribution
>>> mechanisms be unified? That might mitigate some of the maintenance
>>> cost and better encapsulate the added complexity.
>>>
>>> Michael
>>>
>>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
>>> <martin.mor...@roswellpark.org> wrote:
>>>> We switched to TwoBitFile with a recent ensembl release, thinking that it 
>>>> had better performance and other characteristics compared to the previous 
>>>> FaFile.
>>>>
>>>> The 'recipe' used to create the FaFiles did not explicitly trim the label; 
>>>> that appears to be something done by Rsamtools::indexFa and hence (a now 
>>>> quite dated) version of samtools.
>>>>
>>>> I'm not precisely sure where we stand on correcting this. The original 
>>>> approach just takes what we're given and makes a 2bit file. At least 
>>>> provisionally we had decided (after Thurs / Fri exchanges) to make the 
>>>> seqlevels sensible on the way in to annotation hub; this is against Sean's 
>>>> advice and I'm not really a big fan of this.
>>>>
>>>> I like the idea of being able to dynamically remap the seqlevels when the 
>>>> 2bit file is loaded by AnnotationHub, which would require Herve's 
>>>> suggestion of settable seqlevels on TwoBitFile.
>>>>
>>>> Martin
>>>> ________________________________________
>>>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
>>>> Johannes [johannes.rai...@eurac.edu]
>>>> Sent: Saturday, January 09, 2016 11:01 AM
>>>> To: Hervé Pagès
>>>> Cc: Michael Lawrence; Martin Morgan
>>>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from 
>>>> AnnotationHub
>>>>
>>>> Yes, using BSGenome would help in this case.
>>>> In the long run I think it might be important to have this fixed, not 
>>>> necessarily for human, but for other species/genome builds for which there 
>>>> might not be an BSGenome package available; through AnnotationHub all GTF 
>>>> files and fasta files would be available. Note also that the FaFiles from 
>>>> Ensembl do have the “correct” chromosome names although I assume they were 
>>>> built from the same Ensembl fasta files than the TwoBitFiles.
>>>>
>>>> jo
>>>>
>>>>> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>>>>
>>>>> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>>>>>> That is one solution. But everyone using that genome would need to
>>>>>> reset the seqlevels to the "standard" ones. In this specific case, is
>>>>>> there any reason not to just use the BSgenome for GRCh38?
>>>>>
>>>>> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
>>>>> use case. Just wanted to mention that the ability to rename the
>>>>> sequences in a TwoBitFile, FastaFile, or other file-based object that
>>>>> supports seqinfo() would be useful in general.
>>>>>
>>>>> H.
>>>>>
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpa...@fredhutch.org> 
>>>>>> wrote:
>>>>>>> Hi Jo, Michael,
>>>>>>>
>>>>>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>>>>>> you need for this is an extra slot for storing the user-supplied
>>>>>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>>>>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>>>>>> them. But you don't need to support all that. Supporting renaming would
>>>>>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>>>>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>>>>>>
>>>>>>> H.
>>>>>>>
>>>>>>>
>>>>>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>>>>>>>
>>>>>>>> I agree, I would not modify the file content. At present it is however 
>>>>>>>> not
>>>>>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome 
>>>>>>>> names
>>>>>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of 
>>>>>>>> the
>>>>>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of 
>>>>>>>> all
>>>>>>>> but the first name-part would help here...
>>>>>>>>
>>>>>>>> jo
>>>>>>>>
>>>>>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seand...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I will make the small editorial comment to guard against modifying 
>>>>>>>>> file
>>>>>>>>> content on transit into the hub object. On the client side (after 
>>>>>>>>> getting
>>>>>>>>> such an object) I think a “fix” would be to have a quick seqnames 
>>>>>>>>> method to
>>>>>>>>> strip off all but the first whitespace delimited piece.
>>>>>>>>>
>>>>>>>>> Sean
>>>>>>>>>
>>>>>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence 
>>>>>>>>>> <lawrence.mich...@gene.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> This is perhaps something that could be handled when population the
>>>>>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>>>>>>>>> chromosome names.
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>>>>>>>> <johannes.rai...@eurac.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>> dear all,
>>>>>>>>>>>
>>>>>>>>>>> I just run into a problem with a TwoBitFile I fetched from
>>>>>>>>>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA 
>>>>>>>>>>> sequence, as
>>>>>>>>>>> provided by Ensembl:
>>>>>>>>>>>
>>>>>>>>>>>> library(AnnotationHub)
>>>>>>>>>>>> ah <- AnnotationHub()
>>>>>>>>>>>> tbf <- ah[["AH50068”]]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> head(seqnames(seqinfo(tbf)))
>>>>>>>>>>>
>>>>>>>>>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>>>>>>>>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>>>>>>>>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>>>>>>>>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>>>>>>>>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>>>>>>>>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>>>>>>>>>>
>>>>>>>>>>> Would be nice, if the seqnames would be really just the chromsome 
>>>>>>>>>>> names
>>>>>>>>>>> and not the whole string from the FA file header. Is there a way I 
>>>>>>>>>>> could fix
>>>>>>>>>>> the file myself or is this something that should be fixed in the 
>>>>>>>>>>> rtracklayer
>>>>>>>>>>> or AnnotationHub package when the TwoBitFile is created?
>>>>>>>>>>>
>>>>>>>>>>> thanks, jo
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel@r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel@r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel@r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Hervé Pagès
>>>>>>>
>>>>>>> Program in Computational Biology
>>>>>>> Division of Public Health Sciences
>>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>> P.O. Box 19024
>>>>>>> Seattle, WA 98109-1024
>>>>>>>
>>>>>>> E-mail: hpa...@fredhutch.org
>>>>>>> Phone:  (206) 667-5791
>>>>>>> Fax:    (206) 667-1319
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpa...@fredhutch.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>
>>>> _______________________________________________
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>> This email message may contain legally privileged and/or confidential 
>>>> information.  If you are not the intended recipient(s), or the employee or 
>>>> agent responsible for the delivery of this message to the intended 
>>>> recipient(s), you are hereby notified that any disclosure, copying, 
>>>> distribution, or use of this email message is prohibited.  If you have 
>>>> received this message in error, please notify the sender immediately by 
>>>> e-mail and delete this email message from your computer. Thank you.
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpa...@fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Reply via email to