On Mon, Aug 15, 2011 at 3:01 PM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.c...@googlemail.com> 
> wrote:
>> Thanks. If I manually set the BED strand to column 5, then the extract tool
>> can be used with both the original NCBI GFF3 file and the BED conversion.
>> I have filtered these on gene features, and noticed a discrepancy.
>> GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691.
>> For BED the start coordinate is zero-indexed and the end coordinate is
>> one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to
>> 2691 (and Galaxy converts this correctly).
>> Using the extract tool with the gene features correctly get the nucleotide
>> sequence of NEQ003 running from ATG...TAA regardless of if I use the
>> genes in GFF3 format or in BED format (good).
>> However, the FASTA output uses different names because it embeds
>> the start/end co-ordindates as is. Thus using GFF3 features, the
>> sequence name includes _883_2691_ while using BED features the
>> same sequence has instead _882_2691_ for its name.
>> I propose this be harmonised by always using one-based counting
>> in the FASTA names (as done in GFF files but also GenBank, EMBL,
>> etc) rather than the convention used in BED files (and Python) which
>> is confusing to most non-programmers.
> i.e. I suggest this change (with new tests to enforce it),
> https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1
> This is currently the one and only commit on this new branch,
> https://bitbucket.org/peterjc/galaxy-central/src/extract_region2

Second commit to use the newly added BED file in the converter's
tests as well:

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to