Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-10 Thread Rainer Johannes
That would be great! I don’t think we would loose any information with that 
behaviour. All fasta files from Ensembl have that format (id, whitespace and 
description). Implementing that would render the fasta files from Ensembl 
provided as TwoBit files via AnnotationHub usable without having to tweak the 
objects afterwards. I would highly appreciate that (and people working with 
other species than mouse/human/rat too)!

jo

> On 09 Jan 2016, at 21:19, Hervé Pagès <hpa...@fredhutch.org> wrote:
> 
> On 01/09/2016 08:42 AM, Michael Lawrence wrote:
>> I can understand the desire to avoid defining and enforcing our own
>> standards on third-party data: it's error-prone, potentially
>> confusing, etc. But the same is even more true of expecting the user
>> to perform the mapping via some adhoc approach.
>> 
>> It's unfortunate that Ensembl does not follow the convention of naming
>> their FASTA sequences by their seqlevels, but I'm not sure how
>> wide-spread that convention is in the first place.
> 
> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
> it seems that:
> 
>  The word following the ">" symbol is the identifier of the sequence,
>  and the rest of the line is the description (both are optional).
> 
> so Rsamtools::indexFa is doing the right thing by trimming the
> description line. Maybe that's what seqinfo,TwoBitFile should do
> too?
> 
> H.
> 
>> 
>> Why does Bioconductor distribute genomes in two different ways:
>> BSgenome and via AnnotationHub? Couldn't those two distribution
>> mechanisms be unified? That might mitigate some of the maintenance
>> cost and better encapsulate the added complexity.
>> 
>> Michael
>> 
>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
>> <martin.mor...@roswellpark.org> wrote:
>>> We switched to TwoBitFile with a recent ensembl release, thinking that it 
>>> had better performance and other characteristics compared to the previous 
>>> FaFile.
>>> 
>>> The 'recipe' used to create the FaFiles did not explicitly trim the label; 
>>> that appears to be something done by Rsamtools::indexFa and hence (a now 
>>> quite dated) version of samtools.
>>> 
>>> I'm not precisely sure where we stand on correcting this. The original 
>>> approach just takes what we're given and makes a 2bit file. At least 
>>> provisionally we had decided (after Thurs / Fri exchanges) to make the 
>>> seqlevels sensible on the way in to annotation hub; this is against Sean's 
>>> advice and I'm not really a big fan of this.
>>> 
>>> I like the idea of being able to dynamically remap the seqlevels when the 
>>> 2bit file is loaded by AnnotationHub, which would require Herve's 
>>> suggestion of settable seqlevels on TwoBitFile.
>>> 
>>> Martin
>>> 
>>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
>>> Johannes [johannes.rai...@eurac.edu]
>>> Sent: Saturday, January 09, 2016 11:01 AM
>>> To: Hervé Pagès
>>> Cc: Michael Lawrence; Martin Morgan
>>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from 
>>> AnnotationHub
>>> 
>>> Yes, using BSGenome would help in this case.
>>> In the long run I think it might be important to have this fixed, not 
>>> necessarily for human, but for other species/genome builds for which there 
>>> might not be an BSGenome package available; through AnnotationHub all GTF 
>>> files and fasta files would be available. Note also that the FaFiles from 
>>> Ensembl do have the “correct” chromosome names although I assume they were 
>>> built from the same Ensembl fasta files than the TwoBitFiles.
>>> 
>>> jo
>>> 
>>>> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>>> 
>>>> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>>>>> That is one solution. But everyone using that genome would need to
>>>>> reset the seqlevels to the "standard" ones. In this specific case, is
>>>>> there any reason not to just use the BSgenome for GRCh38?
>>>> 
>>>> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
>>>> use case. Just wanted to mention that the ability to rename the
>>>> sequences in a TwoBitFile, FastaFile, or other file-based object that
>>>> supports seqinfo() would be useful in general.
>>>> 
>>>> H.
>>>> 
>>>>&

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-10 Thread Michael Lawrence
Yes, thanks for those details, Herve. I changed rtracklayer to take
the first word as the seqlevels.

Michael

On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes
<johannes.rai...@eurac.edu> wrote:
> That would be great! I don’t think we would loose any information with that 
> behaviour. All fasta files from Ensembl have that format (id, whitespace and 
> description). Implementing that would render the fasta files from Ensembl 
> provided as TwoBit files via AnnotationHub usable without having to tweak the 
> objects afterwards. I would highly appreciate that (and people working with 
> other species than mouse/human/rat too)!
>
> jo
>
>> On 09 Jan 2016, at 21:19, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>
>> On 01/09/2016 08:42 AM, Michael Lawrence wrote:
>>> I can understand the desire to avoid defining and enforcing our own
>>> standards on third-party data: it's error-prone, potentially
>>> confusing, etc. But the same is even more true of expecting the user
>>> to perform the mapping via some adhoc approach.
>>>
>>> It's unfortunate that Ensembl does not follow the convention of naming
>>> their FASTA sequences by their seqlevels, but I'm not sure how
>>> wide-spread that convention is in the first place.
>>
>> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
>> it seems that:
>>
>>  The word following the ">" symbol is the identifier of the sequence,
>>  and the rest of the line is the description (both are optional).
>>
>> so Rsamtools::indexFa is doing the right thing by trimming the
>> description line. Maybe that's what seqinfo,TwoBitFile should do
>> too?
>>
>> H.
>>
>>>
>>> Why does Bioconductor distribute genomes in two different ways:
>>> BSgenome and via AnnotationHub? Couldn't those two distribution
>>> mechanisms be unified? That might mitigate some of the maintenance
>>> cost and better encapsulate the added complexity.
>>>
>>> Michael
>>>
>>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
>>> <martin.mor...@roswellpark.org> wrote:
>>>> We switched to TwoBitFile with a recent ensembl release, thinking that it 
>>>> had better performance and other characteristics compared to the previous 
>>>> FaFile.
>>>>
>>>> The 'recipe' used to create the FaFiles did not explicitly trim the label; 
>>>> that appears to be something done by Rsamtools::indexFa and hence (a now 
>>>> quite dated) version of samtools.
>>>>
>>>> I'm not precisely sure where we stand on correcting this. The original 
>>>> approach just takes what we're given and makes a 2bit file. At least 
>>>> provisionally we had decided (after Thurs / Fri exchanges) to make the 
>>>> seqlevels sensible on the way in to annotation hub; this is against Sean's 
>>>> advice and I'm not really a big fan of this.
>>>>
>>>> I like the idea of being able to dynamically remap the seqlevels when the 
>>>> 2bit file is loaded by AnnotationHub, which would require Herve's 
>>>> suggestion of settable seqlevels on TwoBitFile.
>>>>
>>>> Martin
>>>> 
>>>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
>>>> Johannes [johannes.rai...@eurac.edu]
>>>> Sent: Saturday, January 09, 2016 11:01 AM
>>>> To: Hervé Pagès
>>>> Cc: Michael Lawrence; Martin Morgan
>>>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from 
>>>> AnnotationHub
>>>>
>>>> Yes, using BSGenome would help in this case.
>>>> In the long run I think it might be important to have this fixed, not 
>>>> necessarily for human, but for other species/genome builds for which there 
>>>> might not be an BSGenome package available; through AnnotationHub all GTF 
>>>> files and fasta files would be available. Note also that the FaFiles from 
>>>> Ensembl do have the “correct” chromosome names although I assume they were 
>>>> built from the same Ensembl fasta files than the TwoBitFiles.
>>>>
>>>> jo
>>>>
>>>>> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>>>>
>>>>> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>>>>>> That is one solution. But everyone using that genome would need to
>>>>>> reset the seqlevels to the "standard&q

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-09 Thread Hervé Pagès

On 01/09/2016 08:42 AM, Michael Lawrence wrote:

I can understand the desire to avoid defining and enforcing our own
standards on third-party data: it's error-prone, potentially
confusing, etc. But the same is even more true of expecting the user
to perform the mapping via some adhoc approach.

It's unfortunate that Ensembl does not follow the convention of naming
their FASTA sequences by their seqlevels, but I'm not sure how
wide-spread that convention is in the first place.


Actually, and according to https://en.wikipedia.org/wiki/FASTA_format,
it seems that:

  The word following the ">" symbol is the identifier of the sequence,
  and the rest of the line is the description (both are optional).

so Rsamtools::indexFa is doing the right thing by trimming the
description line. Maybe that's what seqinfo,TwoBitFile should do
too?

H.



Why does Bioconductor distribute genomes in two different ways:
BSgenome and via AnnotationHub? Couldn't those two distribution
mechanisms be unified? That might mitigate some of the maintenance
cost and better encapsulate the added complexity.

Michael

On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin
<martin.mor...@roswellpark.org> wrote:

We switched to TwoBitFile with a recent ensembl release, thinking that it had 
better performance and other characteristics compared to the previous FaFile.

The 'recipe' used to create the FaFiles did not explicitly trim the label; that 
appears to be something done by Rsamtools::indexFa and hence (a now quite 
dated) version of samtools.

I'm not precisely sure where we stand on correcting this. The original approach 
just takes what we're given and makes a 2bit file. At least provisionally we 
had decided (after Thurs / Fri exchanges) to make the seqlevels sensible on the 
way in to annotation hub; this is against Sean's advice and I'm not really a 
big fan of this.

I like the idea of being able to dynamically remap the seqlevels when the 2bit 
file is loaded by AnnotationHub, which would require Herve's suggestion of 
settable seqlevels on TwoBitFile.

Martin

From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
Johannes [johannes.rai...@eurac.edu]
Sent: Saturday, January 09, 2016 11:01 AM
To: Hervé Pagès
Cc: Michael Lawrence; Martin Morgan
Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Yes, using BSGenome would help in this case.
In the long run I think it might be important to have this fixed, not 
necessarily for human, but for other species/genome builds for which there 
might not be an BSGenome package available; through AnnotationHub all GTF files 
and fasta files would be available. Note also that the FaFiles from Ensembl do 
have the “correct” chromosome names although I assume they were built from the 
same Ensembl fasta files than the TwoBitFiles.

jo


On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:

On 01/08/2016 01:09 PM, Michael Lawrence wrote:

That is one solution. But everyone using that genome would need to
reset the seqlevels to the "standard" ones. In this specific case, is
there any reason not to just use the BSgenome for GRCh38?


I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
use case. Just wanted to mention that the ability to rename the
sequences in a TwoBitFile, FastaFile, or other file-based object that
supports seqinfo() would be useful in general.

H.



On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpa...@fredhutch.org> wrote:

Hi Jo, Michael,

What about implementing a seqlevels() setter for TwoBitFile objects? All
you need for this is an extra slot for storing the user-supplied
seqlevels. Note that in general the seqlevels() setter allows more than
renaming the seqlevels. It also allows dropping, adding, and shuffling
them. But you don't need to support all that. Supporting renaming would
already go a long way. See selectMethod("seqlevels<-", "TxDb") in
GenomicFeatures for an example of a restricted "seqlevels<-" method.

H.


On 01/08/2016 09:50 AM, Rainer Johannes wrote:


I agree, I would not modify the file content. At present it is however not
possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names
in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
but the first name-part would help here...

jo


On 08 Jan 2016, at 15:18, Sean Davis <seand...@gmail.com> wrote:

I will make the small editorial comment to guard against modifying file
content on transit into the hub object. On the client side (after getting
such an object) I think a “fix” would be to have a quick seqnames method to
strip off all but the first whitespace delimited piece.

Sean


On Jan 8, 2016, at 8:40 AM, Michael Lawrence <lawrence.mich...@gene.com>
wrote:

This is perhaps som

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-09 Thread Morgan, Martin
We switched to TwoBitFile with a recent ensembl release, thinking that it had 
better performance and other characteristics compared to the previous FaFile.

The 'recipe' used to create the FaFiles did not explicitly trim the label; that 
appears to be something done by Rsamtools::indexFa and hence (a now quite 
dated) version of samtools.

I'm not precisely sure where we stand on correcting this. The original approach 
just takes what we're given and makes a 2bit file. At least provisionally we 
had decided (after Thurs / Fri exchanges) to make the seqlevels sensible on the 
way in to annotation hub; this is against Sean's advice and I'm not really a 
big fan of this. 

I like the idea of being able to dynamically remap the seqlevels when the 2bit 
file is loaded by AnnotationHub, which would require Herve's suggestion of 
settable seqlevels on TwoBitFile.

Martin

From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer 
Johannes [johannes.rai...@eurac.edu]
Sent: Saturday, January 09, 2016 11:01 AM
To: Hervé Pagès
Cc: Michael Lawrence; Martin Morgan
Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

Yes, using BSGenome would help in this case.
In the long run I think it might be important to have this fixed, not 
necessarily for human, but for other species/genome builds for which there 
might not be an BSGenome package available; through AnnotationHub all GTF files 
and fasta files would be available. Note also that the FaFiles from Ensembl do 
have the “correct” chromosome names although I assume they were built from the 
same Ensembl fasta files than the TwoBitFiles.

jo

> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote:
>
> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>> That is one solution. But everyone using that genome would need to
>> reset the seqlevels to the "standard" ones. In this specific case, is
>> there any reason not to just use the BSgenome for GRCh38?
>
> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
> use case. Just wanted to mention that the ability to rename the
> sequences in a TwoBitFile, FastaFile, or other file-based object that
> supports seqinfo() would be useful in general.
>
> H.
>
>>
>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpa...@fredhutch.org> wrote:
>>> Hi Jo, Michael,
>>>
>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>> you need for this is an extra slot for storing the user-supplied
>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>> them. But you don't need to support all that. Supporting renaming would
>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>>
>>> H.
>>>
>>>
>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>>>
>>>> I agree, I would not modify the file content. At present it is however not
>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome 
>>>> names
>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
>>>> but the first name-part would help here...
>>>>
>>>> jo
>>>>
>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seand...@gmail.com> wrote:
>>>>>
>>>>> I will make the small editorial comment to guard against modifying file
>>>>> content on transit into the hub object. On the client side (after getting
>>>>> such an object) I think a “fix” would be to have a quick seqnames method 
>>>>> to
>>>>> strip off all but the first whitespace delimited piece.
>>>>>
>>>>> Sean
>>>>>
>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence <lawrence.mich...@gene.com>
>>>>>> wrote:
>>>>>>
>>>>>> This is perhaps something that could be handled when population the
>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>>>>> chromosome names.
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>>>> <johannes.rai...@eurac.edu> wrote:
>>>>>>>
>>>>>>> dear all,
>>>>>>>
>>>&g

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-09 Thread Rainer Johannes
Yes, using BSGenome would help in this case. 
In the long run I think it might be important to have this fixed, not 
necessarily for human, but for other species/genome builds for which there 
might not be an BSGenome package available; through AnnotationHub all GTF files 
and fasta files would be available. Note also that the FaFiles from Ensembl do 
have the “correct” chromosome names although I assume they were built from the 
same Ensembl fasta files than the TwoBitFiles.

jo

> On 08 Jan 2016, at 22:49, Hervé Pagès  wrote:
> 
> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>> That is one solution. But everyone using that genome would need to
>> reset the seqlevels to the "standard" ones. In this specific case, is
>> there any reason not to just use the BSgenome for GRCh38?
> 
> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
> use case. Just wanted to mention that the ability to rename the
> sequences in a TwoBitFile, FastaFile, or other file-based object that
> supports seqinfo() would be useful in general.
> 
> H.
> 
>> 
>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  wrote:
>>> Hi Jo, Michael,
>>> 
>>> What about implementing a seqlevels() setter for TwoBitFile objects? All
>>> you need for this is an extra slot for storing the user-supplied
>>> seqlevels. Note that in general the seqlevels() setter allows more than
>>> renaming the seqlevels. It also allows dropping, adding, and shuffling
>>> them. But you don't need to support all that. Supporting renaming would
>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
>>> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>>> 
>>> H.
>>> 
>>> 
>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
 
 I agree, I would not modify the file content. At present it is however not
 possible to use e.g. getSeq on these TwoBitFiles, since the chromosome 
 names
 in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
 TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
 but the first name-part would help here...
 
 jo
 
> On 08 Jan 2016, at 15:18, Sean Davis  wrote:
> 
> I will make the small editorial comment to guard against modifying file
> content on transit into the hub object. On the client side (after getting
> such an object) I think a “fix” would be to have a quick seqnames method 
> to
> strip off all but the first whitespace delimited piece.
> 
> Sean
> 
>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence 
>> wrote:
>> 
>> This is perhaps something that could be handled when population the
>> hub, but I'm not sure how rtracklayer could automatically derive the
>> chromosome names.
>> 
>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>  wrote:
>>> 
>>> dear all,
>>> 
>>> I just run into a problem with a TwoBitFile I fetched from
>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA 
>>> sequence, as
>>> provided by Ensembl:
>>> 
 library(AnnotationHub)
 ah <- AnnotationHub()
 tbf <- ah[["AH50068”]]
>>> 
>>> 
 head(seqnames(seqinfo(tbf)))
>>> 
>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>> 
>>> Would be nice, if the seqnames would be really just the chromsome names
>>> and not the whole string from the FA file header. Is there a way I 
>>> could fix
>>> the file myself or is this something that should be fixed in the 
>>> rtracklayer
>>> or AnnotationHub package when the TwoBitFile is created?
>>> 
>>> thanks, jo
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
>> 
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
 
 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel
 
>>> 
>>> --
>>> Hervé Pagès
>>> 
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>> 
>>> E-mail: hpa...@fredhutch.org
>>> Phone:  (206) 667-5791
>>> Fax:(206) 

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-09 Thread Tim Triche, Jr.
Also things like organismdbi don't seem to exist for organisms other than 
human, mouse, rat.  So if you want to use that infrastructure for fly or worms, 
you're SOL at the moment. 

This is a highly topical discussion since many/most microarray probes can be 
profitably (in terms of knowledge, not money) remapped to more contemporary or 
richer transcriptomes and thus used to explore the generality of findings.  The 
OrganismDb/BsGenome infrastructure doesn't well accommodate this use case, yet, 
but Zhilong's recent remarks suggest that a unified approach could be broadly 
useful for many investigators. 

Being a lazy bum, I tried to dump the task back on him (no good deed goes 
unpunished) but since Jo is also a glutton for punishment and an author of fine 
Ensembl support packages...

:-)

In all seriousness the generosity of the BioC community cannot be overstated. 
You guys are great

--t

> On Jan 9, 2016, at 8:01 AM, Rainer Johannes  wrote:
> 
> Yes, using BSGenome would help in this case. 
> In the long run I think it might be important to have this fixed, not 
> necessarily for human, but for other species/genome builds for which there 
> might not be an BSGenome package available; through AnnotationHub all GTF 
> files and fasta files would be available. Note also that the FaFiles from 
> Ensembl do have the “correct” chromosome names although I assume they were 
> built from the same Ensembl fasta files than the TwoBitFiles.
> 
> jo
> 
>> On 08 Jan 2016, at 22:49, Hervé Pagès  wrote:
>> 
>> On 01/08/2016 01:09 PM, Michael Lawrence wrote:
>>> That is one solution. But everyone using that genome would need to
>>> reset the seqlevels to the "standard" ones. In this specific case, is
>>> there any reason not to just use the BSgenome for GRCh38?
>> 
>> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
>> use case. Just wanted to mention that the ability to rename the
>> sequences in a TwoBitFile, FastaFile, or other file-based object that
>> supports seqinfo() would be useful in general.
>> 
>> H.
>> 
>>> 
 On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  wrote:
 Hi Jo, Michael,
 
 What about implementing a seqlevels() setter for TwoBitFile objects? All
 you need for this is an extra slot for storing the user-supplied
 seqlevels. Note that in general the seqlevels() setter allows more than
 renaming the seqlevels. It also allows dropping, adding, and shuffling
 them. But you don't need to support all that. Supporting renaming would
 already go a long way. See selectMethod("seqlevels<-", "TxDb") in
 GenomicFeatures for an example of a restricted "seqlevels<-" method.
 
 H.
 
 
> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
> 
> I agree, I would not modify the file content. At present it is however not
> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome 
> names
> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
> but the first name-part would help here...
> 
> jo
> 
>> On 08 Jan 2016, at 15:18, Sean Davis  wrote:
>> 
>> I will make the small editorial comment to guard against modifying file
>> content on transit into the hub object. On the client side (after getting
>> such an object) I think a “fix” would be to have a quick seqnames method 
>> to
>> strip off all but the first whitespace delimited piece.
>> 
>> Sean
>> 
>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence 
>>> wrote:
>>> 
>>> This is perhaps something that could be handled when population the
>>> hub, but I'm not sure how rtracklayer could automatically derive the
>>> chromosome names.
>>> 
>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>>  wrote:
 
 dear all,
 
 I just run into a problem with a TwoBitFile I fetched from
 AnnotationHub. I was fetching a TwoBitFile with the genomic DNA 
 sequence, as
 provided by Ensembl:
 
> library(AnnotationHub)
> ah <- AnnotationHub()
> tbf <- ah[["AH50068”]]
 
 
> head(seqnames(seqinfo(tbf)))
 
 [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
 [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
 [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
 [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
 [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
 [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
 
 Would be nice, if the seqnames would be really just the chromsome names
 and 

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-08 Thread Rainer Johannes
I agree, I would not modify the file content. At present it is however not 
possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names in 
the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the 
TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all but 
the first name-part would help here...

jo

> On 08 Jan 2016, at 15:18, Sean Davis  wrote:
> 
> I will make the small editorial comment to guard against modifying file 
> content on transit into the hub object. On the client side (after getting 
> such an object) I think a “fix” would be to have a quick seqnames method to 
> strip off all but the first whitespace delimited piece.
> 
> Sean
> 
>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence  
>> wrote:
>> 
>> This is perhaps something that could be handled when population the
>> hub, but I'm not sure how rtracklayer could automatically derive the
>> chromosome names.
>> 
>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
>>  wrote:
>>> dear all,
>>> 
>>> I just run into a problem with a TwoBitFile I fetched from AnnotationHub. I 
>>> was fetching a TwoBitFile with the genomic DNA sequence, as provided by 
>>> Ensembl:
>>> 
 library(AnnotationHub)
 ah <- AnnotationHub()
 tbf <- ah[["AH50068”]]
>>> 
 head(seqnames(seqinfo(tbf)))
>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>>> 
>>> Would be nice, if the seqnames would be really just the chromsome names and 
>>> not the whole string from the FA file header. Is there a way I could fix 
>>> the file myself or is this something that should be fixed in the 
>>> rtracklayer or AnnotationHub package when the TwoBitFile is created?
>>> 
>>> thanks, jo
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-08 Thread Michael Lawrence
That is one solution. But everyone using that genome would need to
reset the seqlevels to the "standard" ones. In this specific case, is
there any reason not to just use the BSgenome for GRCh38?

On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  wrote:
> Hi Jo, Michael,
>
> What about implementing a seqlevels() setter for TwoBitFile objects? All
> you need for this is an extra slot for storing the user-supplied
> seqlevels. Note that in general the seqlevels() setter allows more than
> renaming the seqlevels. It also allows dropping, adding, and shuffling
> them. But you don't need to support all that. Supporting renaming would
> already go a long way. See selectMethod("seqlevels<-", "TxDb") in
> GenomicFeatures for an example of a restricted "seqlevels<-" method.
>
> H.
>
>
> On 01/08/2016 09:50 AM, Rainer Johannes wrote:
>>
>> I agree, I would not modify the file content. At present it is however not
>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names
>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
>> but the first name-part would help here...
>>
>> jo
>>
>>> On 08 Jan 2016, at 15:18, Sean Davis  wrote:
>>>
>>> I will make the small editorial comment to guard against modifying file
>>> content on transit into the hub object. On the client side (after getting
>>> such an object) I think a “fix” would be to have a quick seqnames method to
>>> strip off all but the first whitespace delimited piece.
>>>
>>> Sean
>>>
 On Jan 8, 2016, at 8:40 AM, Michael Lawrence 
 wrote:

 This is perhaps something that could be handled when population the
 hub, but I'm not sure how rtracklayer could automatically derive the
 chromosome names.

 On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
  wrote:
>
> dear all,
>
> I just run into a problem with a TwoBitFile I fetched from
> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA sequence, 
> as
> provided by Ensembl:
>
>> library(AnnotationHub)
>> ah <- AnnotationHub()
>> tbf <- ah[["AH50068”]]
>
>
>> head(seqnames(seqinfo(tbf)))
>
> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>
> Would be nice, if the seqnames would be really just the chromsome names
> and not the whole string from the FA file header. Is there a way I could 
> fix
> the file myself or is this something that should be fixed in the 
> rtracklayer
> or AnnotationHub package when the TwoBitFile is created?
>
> thanks, jo
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-08 Thread Hervé Pagès

On 01/08/2016 01:09 PM, Michael Lawrence wrote:

That is one solution. But everyone using that genome would need to
reset the seqlevels to the "standard" ones. In this specific case, is
there any reason not to just use the BSgenome for GRCh38?


I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular
use case. Just wanted to mention that the ability to rename the
sequences in a TwoBitFile, FastaFile, or other file-based object that
supports seqinfo() would be useful in general.

H.



On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès  wrote:

Hi Jo, Michael,

What about implementing a seqlevels() setter for TwoBitFile objects? All
you need for this is an extra slot for storing the user-supplied
seqlevels. Note that in general the seqlevels() setter allows more than
renaming the seqlevels. It also allows dropping, adding, and shuffling
them. But you don't need to support all that. Supporting renaming would
already go a long way. See selectMethod("seqlevels<-", "TxDb") in
GenomicFeatures for an example of a restricted "seqlevels<-" method.

H.


On 01/08/2016 09:50 AM, Rainer Johannes wrote:


I agree, I would not modify the file content. At present it is however not
possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names
in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the
TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all
but the first name-part would help here...

jo


On 08 Jan 2016, at 15:18, Sean Davis  wrote:

I will make the small editorial comment to guard against modifying file
content on transit into the hub object. On the client side (after getting
such an object) I think a “fix” would be to have a quick seqnames method to
strip off all but the first whitespace delimited piece.

Sean


On Jan 8, 2016, at 8:40 AM, Michael Lawrence 
wrote:

This is perhaps something that could be handled when population the
hub, but I'm not sure how rtracklayer could automatically derive the
chromosome names.

On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
 wrote:


dear all,

I just run into a problem with a TwoBitFile I fetched from
AnnotationHub. I was fetching a TwoBitFile with the genomic DNA sequence, as
provided by Ensembl:


library(AnnotationHub)
ah <- AnnotationHub()
tbf <- ah[["AH50068”]]




head(seqnames(seqinfo(tbf)))


[1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
[2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
[3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
[4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
[5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
[6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"

Would be nice, if the seqnames would be really just the chromsome names
and not the whole string from the FA file header. Is there a way I could fix
the file myself or is this something that should be fixed in the rtracklayer
or AnnotationHub package when the TwoBitFile is created?

thanks, jo
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel





___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-08 Thread Hervé Pagès

Hi Jo, Michael,

What about implementing a seqlevels() setter for TwoBitFile objects? All
you need for this is an extra slot for storing the user-supplied
seqlevels. Note that in general the seqlevels() setter allows more than
renaming the seqlevels. It also allows dropping, adding, and shuffling
them. But you don't need to support all that. Supporting renaming would
already go a long way. See selectMethod("seqlevels<-", "TxDb") in
GenomicFeatures for an example of a restricted "seqlevels<-" method.

H.

On 01/08/2016 09:50 AM, Rainer Johannes wrote:

I agree, I would not modify the file content. At present it is however not 
possible to use e.g. getSeq on these TwoBitFiles, since the chromosome names in 
the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of the 
TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of all but 
the first name-part would help here...

jo


On 08 Jan 2016, at 15:18, Sean Davis  wrote:

I will make the small editorial comment to guard against modifying file content 
on transit into the hub object. On the client side (after getting such an 
object) I think a “fix” would be to have a quick seqnames method to strip off 
all but the first whitespace delimited piece.

Sean


On Jan 8, 2016, at 8:40 AM, Michael Lawrence  wrote:

This is perhaps something that could be handled when population the
hub, but I'm not sure how rtracklayer could automatically derive the
chromosome names.

On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
 wrote:

dear all,

I just run into a problem with a TwoBitFile I fetched from AnnotationHub. I was 
fetching a TwoBitFile with the genomic DNA sequence, as provided by Ensembl:


library(AnnotationHub)
ah <- AnnotationHub()
tbf <- ah[["AH50068”]]



head(seqnames(seqinfo(tbf)))

[1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
[2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
[3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
[4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
[5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
[6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"

Would be nice, if the seqnames would be really just the chromsome names and not 
the whole string from the FA file header. Is there a way I could fix the file 
myself or is this something that should be fixed in the rtracklayer or 
AnnotationHub package when the TwoBitFile is created?

thanks, jo
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub

2016-01-08 Thread Michael Lawrence
This is perhaps something that could be handled when population the
hub, but I'm not sure how rtracklayer could automatically derive the
chromosome names.

On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes
 wrote:
> dear all,
>
> I just run into a problem with a TwoBitFile I fetched from AnnotationHub. I 
> was fetching a TwoBitFile with the genomic DNA sequence, as provided by 
> Ensembl:
>
>> library(AnnotationHub)
>> ah <- AnnotationHub()
>> tbf <- ah[["AH50068”]]
>
>> head(seqnames(seqinfo(tbf)))
> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF"
> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF"
> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF"
> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF"
> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF"
> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF"
>
> Would be nice, if the seqnames would be really just the chromsome names and 
> not the whole string from the FA file header. Is there a way I could fix the 
> file myself or is this something that should be fixed in the rtracklayer or 
> AnnotationHub package when the TwoBitFile is created?
>
> thanks, jo
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel