>> However, the FASTA output uses different names because it embeds
>> the start/end co-ordindates as is. Thus using GFF3 features, the
>> sequence name includes _883_2691_ while using BED features the
>> same sequence has instead _882_2691_ for its name.
>> 
>> I propose this be harmonised by always using one-based counting
>> in the FASTA names (as done in GFF files but also GenBank, EMBL,
>> etc) rather than the convention used in BED files (and Python) which
>> is confusing to most non-programmers.
> 
> i.e. I suggest this change (with new tests to enforce it),
> 
> https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1

Peter,

I have concerns about this change.

IMO, the goal of embedding the start/end coords in the fasta name is to (a) 
embed important information from the input file into the fasta name and (b) 
make it simple for users to connect a fasta sequence to an entry in the 
interval file. These goals are achieved with the current code _relative to the 
input file_. 

This connection between the input and output files key. However, in the case of 
a user using a mix a BED and GFF files for a single genome, your concern 
becomes an issue. In practice, I don't think we've seen users encounter this 
issue yet, which leads to me  think that the current code is fine.

One idea to address both of these issues is to embed the original format in the 
fasta name so that it's clear whether the coords are BED or GFF (e.g. 
>hg17_BED_chr1_147962192_147962580).

Best,
J.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to