Re: [galaxy-user] extracting a subset of sequences from a very large fasta file(1.5 million)

Jennifer Jackson Mon, 03 Dec 2012 06:18:11 -0800

Hi Perumal,

There isn't a simple fasta extraction tool on the public Main Galaxyserver, but the extraction is possible and could be grouped into aworkflow for re-use once completed. This is simpler that it first looks,really just 4 steps:


1. Convert the fasta file to tabular:

    'FASTA manipulation' -> <javascript:void(0)>FASTA-to-Tabular

Settings: For the option "How many columns to divide title stringinto?:" use "2" if there is "identifier" and "description" text. See thenext step for more details.


2. Load your list of identifiers as tabular

This mean "tabular" text format. Adjust the datatype to be "tabularas needed, and any other formatting so that the "identifiers" areexactly the same in both files. I am not sure if this is what you meantby "fasta headers". To be clear, in the fasta file (#1) any charactersafter the leading ">" but before the first whitespace (tab, space, etc)are considered the "identifier" and everything else on the line isconsidered the "description". This file (#2) should only contain the"identifier", not the "description. Here is a link to FASTA format incase you run into problems here (the IDs not being exact will almostcertainly be the root cause of any issues):

http://wiki.galaxyproject.org/Learn/Datatypes#Fasta

3. Compare the two files together, subsetting out the entries in #1 thatare present in #2.


    ' Join, Subtract and Group' -> Compare two Datasets

Settings: Compare file #1, column 1 (c1), against file #1, column 2(c1), 'To find' = Matching rows of 1st dataset.



4. Transform the results back to tabular format.

     'FASTA manipulation' -> Tabular-to-FASTA

Settings: Be sure to account for any description fields, if theyare included in your data. At this point you can either put them intothe final fasta output or omit the row/data altogether and just pull outidentifiers/sequence.



Hopefully this helps -

Jen
Galaxy team
On 11/30/12 7:55 AM, Perumal Vijayan wrote:

I have successfully uploaded a large fasta file (2.5 million genomicsequence contigs) onto Galaxy server. I wish to extract a subset ofsequences from this file. I have a list of the fasta headers. Isthere a way I can accomplish this on Galaxy?

--
Perumal Vijayan
Saskatoon
Canada


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] extracting a subset of sequences from a very large fasta file(1.5 million)

Reply via email to