On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.c...@googlemail.com> wrote: > Hi Jeremy, > > Things do indeed look much better after your commit last night, thanks: > https://bitbucket.org/galaxy/galaxy-central/changeset/3c7416baa157 > > On Mon, Aug 15, 2011 at 12:52 AM, Jeremy Goecks <jeremy.goe...@emory.edu> > wrote: >>> Well, sort of. After converting that GFF3 file into BED, the strand column >>> isn't set in the metadata. That seems important! >> >> We'll look into this. > > Thanks. If I manually set the BED strand to column 5, then the extract tool > can be used with both the original NCBI GFF3 file and the BED conversion. > I have filtered these on gene features, and noticed a discrepancy. > > GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691. > > For BED the start coordinate is zero-indexed and the end coordinate is > one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to > 2691 (and Galaxy converts this correctly). > > Using the extract tool with the gene features correctly get the nucleotide > sequence of NEQ003 running from ATG...TAA regardless of if I use the > genes in GFF3 format or in BED format (good). > > However, the FASTA output uses different names because it embeds > the start/end co-ordindates as is. Thus using GFF3 features, the > sequence name includes _883_2691_ while using BED features the > same sequence has instead _882_2691_ for its name. > > I propose this be harmonised by always using one-based counting > in the FASTA names (as done in GFF files but also GenBank, EMBL, > etc) rather than the convention used in BED files (and Python) which > is confusing to most non-programmers.
Jeremy, Sorry -more questions - could you explain what the "Interpret features when possible" setting in the "Extract Genomic DNA" is meant to do? The tool's help text doesn't say anything (other than it is only for GTF/GFF files). In the NC_005213.1 example turning "Interpret features when possible" on seems to massively collapse down the number of features. I'm not sure what is happening but suspect this is in part down to the NCBI GFF3 file being broken with regards to the lack of any ID tags? See also: http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html And on another issue, looking at the code for extract_genomic_dna.py there appears to be no attempt to support circular genomes with features wrapping the origin. The gene NEQ001 would be a great example of this, except the NCBI's GFF3 file doesn't do this properly (as noted in my blog post). I'm not sure if the BED format attempts to cover features wrapping the origin of a circular reference genome: http://genome.ucsc.edu/FAQ/FAQformat#format1 Regards, Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/