Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-10 Thread Michael Lawrence
Yes, thanks for those details, Herve. I changed rtracklayer to take
the first word as the seqlevels.

Michael

On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes
 wrote:
> That would be great! I don’t think we would loose any information with that 
> behaviour. All fasta files from Ensembl have that format (id, whitespace and 
> description). Implementing that would render the fasta files from Ensembl 
> provided as TwoBit files via AnnotationHub usable without having to tweak the 
> objects afterwards. I would highly appreciate that (and people working with 
> other species than mouse/human/rat too)!
>
> jo
>
>> On 09 Jan 2016, at 21:19, Hervé Pagès  wrote:
>>
>> On 01/09/2016 08:42 AM, Michael Lawrence wrote:
>>> I can understand the desire to avoid defining and enforcing our own
>>> standards on third-party data: it's error-prone, potentially
>>> confusing, etc. But the same is even more true of expecting the user
>>> to perform the mapping via some adhoc approach.
>>>
>>> It's unfortunate that Ensembl does not follow the convention of naming
>>> their FASTA sequences by their seqlevels, but I'm not sure how
>>> wide-spread that convention is in the first place.
>>
>> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
>> it seems that:
>>
>>  The word following the ">" symbol is the identifier of the sequence,
>>  and the rest of the line is the description (both are optional).
>>
>> so Rsamtools::indexFa is doing the right thing by trimming the
>> description line. Maybe that's what seqinfo,TwoBitFile should do
>> too?
>>
>> H.
>>
>>>
>>> Why does Bioconductor distribute genomes in two different ways:
>>> BSgenome and via AnnotationHub? Couldn't those two distribution
>>> mechanisms be unified? That might mitigate some of the maintenance
>>> cost and better encapsulate the added complexity.
>>>
>>> Michael
>>>
>>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
>>>  wrote:
 We switched to TwoBitFile with a recent ensembl release, thinking that it 
 had better performance and other characteristics compared to the previous 
 FaFile.

 The 'recipe' used to create the FaFiles did not explicitly trim the label; 
 that appears to be something done by Rsamtools::indexFa and hence (a now 
 quite dated) version of samtools.

 I'm not precisely sure where we stand on correcting this. The original 
 approach just takes what we're given and makes a 2bit file. At least 
 provisionally we had decided (after Thurs / Fri exchanges) to make the 
 seqlevels sensible on the way in to annotation hub; this is against Sean's 
 advice and I'm not really a big fan of this.

 I like the idea of being able to dynamically remap the seqlevels when the 
 2bit file is loaded by AnnotationHub, which would require Herve's 
 suggestion of settable seqlevels on TwoBitFile.

 Martin
 
 From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
 Johannes [johannes.rai...@eurac.edu]
 Sent: Saturday, January 09, 2016 11:01 AM
 To: Hervé Pagès
 Cc: Michael Lawrence; Martin Morgan
 Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from 
 AnnotationHub

 Yes, using BSGenome would help in this case.
 In the long run I think it might be important to have this fixed, not 
 necessarily for human, but for other species/genome builds for which there 
 might not be an BSGenome package available; through AnnotationHub all GTF 
 files and fasta files would be available. Note also that the FaFiles from 
 Ensembl do have the “correct” chromosome names although I assume they were 
 built from the same Ensembl fasta files than the TwoBitFiles.

 jo

> On 08 Jan 2016, at 22:49, Hervé Pagès  wrote:
>
> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>> That is one solution. But everyone using that genome would need to
>> reset the seqlevels to the "standard" ones. In this specific case, is
>> there any reason not to just use the BSgenome for GRCh38?
>
> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
> use case. Just wanted to mention that the ability to rename the
> sequences in a TwoBitFile, FastaFile, or other file-based object that
> supports seqinfo() would be useful in general.
>
> H.
>
>>
>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  
>> wrote:
>>> Hi Jo, Michael,
>>>
>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>> you need for this is an extra slot for storing the user-supplied
>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>> them. But you don't need to support all that. Supporting renaming would
>>> already go a long way. See selectMethod("seqlevels<-", "Tx

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-10 Thread Rainer Johannes
That would be great! I don’t think we would loose any information with that 
behaviour. All fasta files from Ensembl have that format (id, whitespace and 
description). Implementing that would render the fasta files from Ensembl 
provided as TwoBit files via AnnotationHub usable without having to tweak the 
objects afterwards. I would highly appreciate that (and people working with 
other species than mouse/human/rat too)!

jo

> On 09 Jan 2016, at 21:19, Hervé Pagès  wrote:
> 
> On 01/09/2016 08:42 AM, Michael Lawrence wrote:
>> I can understand the desire to avoid defining and enforcing our own
>> standards on third-party data: it's error-prone, potentially
>> confusing, etc. But the same is even more true of expecting the user
>> to perform the mapping via some adhoc approach.
>> 
>> It's unfortunate that Ensembl does not follow the convention of naming
>> their FASTA sequences by their seqlevels, but I'm not sure how
>> wide-spread that convention is in the first place.
> 
> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
> it seems that:
> 
>  The word following the ">" symbol is the identifier of the sequence,
>  and the rest of the line is the description (both are optional).
> 
> so Rsamtools::indexFa is doing the right thing by trimming the
> description line. Maybe that's what seqinfo,TwoBitFile should do
> too?
> 
> H.
> 
>> 
>> Why does Bioconductor distribute genomes in two different ways:
>> BSgenome and via AnnotationHub? Couldn't those two distribution
>> mechanisms be unified? That might mitigate some of the maintenance
>> cost and better encapsulate the added complexity.
>> 
>> Michael
>> 
>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
>>  wrote:
>>> We switched to TwoBitFile with a recent ensembl release, thinking that it 
>>> had better performance and other characteristics compared to the previous 
>>> FaFile.
>>> 
>>> The 'recipe' used to create the FaFiles did not explicitly trim the label; 
>>> that appears to be something done by Rsamtools::indexFa and hence (a now 
>>> quite dated) version of samtools.
>>> 
>>> I'm not precisely sure where we stand on correcting this. The original 
>>> approach just takes what we're given and makes a 2bit file. At least 
>>> provisionally we had decided (after Thurs / Fri exchanges) to make the 
>>> seqlevels sensible on the way in to annotation hub; this is against Sean's 
>>> advice and I'm not really a big fan of this.
>>> 
>>> I like the idea of being able to dynamically remap the seqlevels when the 
>>> 2bit file is loaded by AnnotationHub, which would require Herve's 
>>> suggestion of settable seqlevels on TwoBitFile.
>>> 
>>> Martin
>>> 
>>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
>>> Johannes [johannes.rai...@eurac.edu]
>>> Sent: Saturday, January 09, 2016 11:01 AM
>>> To: Hervé Pagès
>>> Cc: Michael Lawrence; Martin Morgan
>>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from 
>>> AnnotationHub
>>> 
>>> Yes, using BSGenome would help in this case.
>>> In the long run I think it might be important to have this fixed, not 
>>> necessarily for human, but for other species/genome builds for which there 
>>> might not be an BSGenome package available; through AnnotationHub all GTF 
>>> files and fasta files would be available. Note also that the FaFiles from 
>>> Ensembl do have the “correct” chromosome names although I assume they were 
>>> built from the same Ensembl fasta files than the TwoBitFiles.
>>> 
>>> jo
>>> 
 On 08 Jan 2016, at 22:49, Hervé Pagès  wrote:
 
 On 01/08/2016 01:09 PM, Michael Lawrence wrote:
> That is one solution. But everyone using that genome would need to
> reset the seqlevels to the "standard" ones. In this specific case, is
> there any reason not to just use the BSgenome for GRCh38?
 
 I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
 use case. Just wanted to mention that the ability to rename the
 sequences in a TwoBitFile, FastaFile, or other file-based object that
 supports seqinfo() would be useful in general.
 
 H.
 
> 
> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  wrote:
>> Hi Jo, Michael,
>> 
>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>> you need for this is an extra slot for storing the user-supplied
>> seqlevels. Note that in general the seqlevels() setter allows more than
>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>> them. But you don't need to support all that. Supporting renaming would
>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>> 
>> H.
>> 
>> 
>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>> 
>>> I agree, I would not modify the file content. At present it is however 

Re: [Bioc-devel] Use of EnsDb in the AnnotationDbi framework

2016-01-10 Thread Morgan, Martin
Thanks Johannes, this is a valuable contribution and I've added it to the 
'official' vignette.

Martin

From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
Johannes [johannes.rai...@eurac.edu]
Sent: Thursday, January 07, 2016 9:10 AM
To: Bioc-devel
Subject: [Bioc-devel] Use of EnsDb in the AnnotationDbi framework

Dear all,

thanks to Vince’s suggestion I have now implemented the central AnnotationDbi 
methods “columns”, “keys”, “keytypes” and “select” for EnsDb objects in my 
ensembldb package version 1.3.11 (EnsDb are TxDb like annotation 
packages/objects tailored to Ensembl based annotations). Thus, EnsDb based 
annotation packages can now be used in the AnnotationDbi framework. The methods 
support in addition also the filter framework of the ensembldb package to 
provide some more fine grained querying and data retrieval.
I have forked the AnnotationDbi package and added a section to its 
“IntroToAnnotationPackages.Rnw” vignette describing how to use EnsDb objects 
along with the above mentioned methods 
(https://github.com/jotsetung/AnnotationDbi). Eventually this could also be 
inserted to the “official” AnnotationDbi package.

best, jo
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel