[galaxy-user] How to filter the sequences containing not[ATCG] character?

2013-12-08 Thread
Hi Jen,As the title, I have a [fasta] file that obtained from a [gtf] file,
cuff102.1atcgtaaagggcgatcuff103.1gtcgttgactgtc
and I want to get the output like this to filter the sequences that contain any 
not[ATCG] character?
cuff102.1atcgtaaagggcgat
I have a large of sequences to filter. I thought a way that firstly convert the 
file to [interval] file, and secondly SELECT the line not matching the patten 
/\t[ATCGatcg]*[^ATCGatcg]/.Am I right? Or there is a one-step way ?


  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] How to filter the sequences containing not[ATCG] character?

2013-12-09 Thread
Hi,
It indeed helps.Your regular expression looks brief and  more useful.BTW, a 
start of line (^) between [] and in the first location, for example, 
[^ATCGatcg] means a character not [ATCGatcg], which maybe not work in the tool 
SELECT.
Thank you for your help!

Date: Mon, 9 Dec 2013 06:34:28 -0800
From: j...@bx.psu.edu
To: zhus...@msn.cn; galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] How to filter the sequences containing not[ATCG] 
character?


  

  
  
Hello,



If the data was in .fastqsanger format, you could use the tool
Manipulate FASTQ, but with .fasta, this is a good way.



But watch your regular expression - test it out on a smaller set to
make sure it is doing what you want. I see a start of the line
character in the middle of your expression (^). I see why it could
be working, with the prior expression being zero or more (*), but
knowing what each character does is generally a good idea. The help
on the tool is good as are many web sites, but this is simple. Also,
you don't need the // slashes, just enter the expression. 



To get you started: I would use something like this, with the Select
tool and Matching:



^..*\t[ATCGatcg]+$



(Only one dot is really required, this is just how I always do it.
Adds a bit of a format sanity check into the filter).



Hope this helps!



Jen

Galaxy team





On 12/8/13 6:21 PM, 朱师云 wrote:



  
  Hi Jen,
As the title, I have a [fasta] file that obtained from a
  [gtf] file,



cuff102.1
atcgtaaagggcgat
cuff103.1
gtcgttgactgtc



and I want to get the output like this to filter the
  sequences that contain any not[ATCG] character?




  cuff102.1
  atcgtaaagggcgat




I have a large of sequences to filter. I thought a way that
  firstly convert the file to [interval] file, and secondly
  SELECT the line not matching the patten /\t[ATCGatcg]*[^ATCGatcg]/.
Am I right? Or there is a
one-step way ?





  

  

  
  

  
  

  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/



-- 
Jennifer Hillman-Jackson
http://galaxyproject.org  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-user] How to filter the sequences containing not[ATCG] character?

2013-12-10 Thread
Hello,
Yeah, it's interesting!I have tried and something like [^ATCGatcg] is 
useful.I have a large file to deal with so I will search something to choose an 
efficient regular expresson.
Thank you.
Date: Mon, 9 Dec 2013 07:24:46 -0800
From: j...@bx.psu.edu
To: zhus...@msn.cn
CC: galaxy-user@lists.bx.psu.edu
Subject: Re: [galaxy-user] How to filter the sequences containing not[ATCG] 
character?


  

  
  
Hello,



You are right! I forgot about that. Aren't regular expressions fun?
And please test it out, if you prefer your method or are just
curious, I didn't try it that way. There are usually a few ways to
do the same thing when using a regex.



But, I am glad that this helped a bit and good luck with the query,



Jen

Galaxy team



On 12/9/13 7:06 AM, 朱师云 wrote:



  
  Hi,



It indeed helps.
Your regular expression looks brief and  more useful.
BTW, a start of line (^) between [] and in the first
  location, for example, [^ATCGatcg] means a character not
  [ATCGatcg], which maybe not work in the tool SELECT.



Thank you for your help!

  

  
Date: Mon, 9 Dec 2013 06:34:28 -0800

From: j...@bx.psu.edu

To: zhus...@msn.cn; galaxy-user@lists.bx.psu.edu

Subject: Re: [galaxy-user] How to filter the sequences
containing not[ATCG] character?



Hello,



If the data was in .fastqsanger format, you could use the
tool Manipulate FASTQ, but with .fasta, this is a good
way.



But watch your regular expression - test it out on a smaller
set to make sure it is doing what you want. I see a start
of the line character in the middle of your expression
(^). I see why it could be working, with the prior
expression being zero or more (*), but knowing what each
character does is generally a good idea. The help on the
tool is good as are many web sites, but this is simple.
Also, you don't need the // slashes, just enter the
expression. 



To get you started: I would use something like this, with
the Select tool and Matching:



^..*\t[ATCGatcg]+$



(Only one dot is really required, this is just how I always
do it. Adds a bit of a format sanity check into the filter).



Hope this helps!



Jen

Galaxy team





On 12/8/13 6:21 PM, 朱师云
  wrote:



  
  Hi Jen,
As the title, I have a [fasta] file that obtained
  from a [gtf] file,



cuff102.1
atcgtaaagggcgat
cuff103.1
gtcgttgactgtc



and I want to get the output like this to filter
  the sequences that contain any not[ATCG] character?




  cuff102.1
  atcgtaaagggcgat




I have a large of sequences to filter. I thought a
  way that firstly convert the file to [interval] file,
  and secondly SELECT the line not matching the patten
  /\t[ATCGatcg]*[^ATCGatcg]/.
Am I right? Or there
is a one-step way ?





  

  

  
  

  
  

  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/



-- 
Jennifer Hillman-Jackson
http://galaxyproject.org
  

  



-- 
Jennifer Hillman-Jackson
http://galaxyproject.org  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis