On Tue, Aug 16, 2011 at 1:03 AM, Jeremy Goecks <jeremy.goe...@emory.edu> wrote:
>>> However, the FASTA output uses different names because it embeds
>>> the start/end co-ordindates as is. Thus using GFF3 features, the
>>> sequence name includes _883_2691_ while using BED features the
>>> same sequence has instead _882_2691_ for its name.
>>>
>>> I propose this be harmonised by always using one-based counting
>>> in the FASTA names (as done in GFF files but also GenBank, EMBL,
>>> etc) rather than the convention used in BED files (and Python) which
>>> is confusing to most non-programmers.
>>
>> i.e. I suggest this change (with new tests to enforce it),
>>
>> https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1
>
> Peter,
>
> I have concerns about this change.
>
> IMO, the goal of embedding the start/end coords in the fasta
> name is to (a) embed important information from the input file
> into the fasta name and (b) make it simple for users to connect
> a fasta sequence to an entry in the interval file. These goals
> are achieved with the current code _relative to the input file_.

Its awkward that two mainstream tabular annotation formats
(BED and the GFF family) use different co-ordinates.

> This connection between the input and output files key.
> However, in the case of a user using a mix a BED and GFF
> files for a single genome, your concern becomes an issue.
> In practice, I don't think we've seen users encounter this
> issue yet, which leads to me  think that the current code is
> fine.
>
> One idea to address both of these issues is to embed the
> original format in the fasta name so that it's clear whether
> the coords are BED or GFF (e.g. >
> hg17_BED_chr1_147962192_147962580).

Or hg17_gtf_chr1_147962192_147962580 etc.

That certainly seems better than the current situation.

However, my preferred solution is to take the FASTA ID from
the annotation file. In GFF3 this would be the ID tag in column
nine (if present), perhaps with an option to use another
custom tag like locus_tag or transcript_id if preferred.

For BED I had initially thought this would the optional
column 4, name. This made me wonder what Galaxy
is doing in converting GFF3 to BED, since column 4 was
populated with generic feature types (gene, CDS, etc
from GFF3 column 2). Shouldn't this be using the feature's
ID tag (if present)? I see code which looks for the tag
transcript_id which looks like how I'd handle the GFF3
ID (for batching multi-location features together).

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to