We have a pair of tools that do just this we (Jim Johnson and myself)
have developed at MSI that have proven useful in a broad range of
scenarios (these are implemented in Python instead of Perl). I have
attached them if you are interested if using/extending them instead of
building up new ones. I think there might be some other variations on
this idea in the tool shed.
-John
On Fri, Oct 25, 2013 at 7:43 PM, Jun Fan <[email protected]> wrote:
> Hi all,
>
>
>
> I am trying to develop a tool which allows the user to use regular
> expression to only keep the lines matching the given pattern in the output.
> There are many special characters, e.g. \ in \d. In the perl script the
> wrapper invokes, the regular expression is printed as Xd+ for the regular
> expression \d+. My question is how to pass these special characters
> including \, (, ), ^ etc. correctly into the perl script.
>
>
>
> Best regards!
>
> Jun
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
> http://galaxyproject.org/search/mailinglists/
<tool id="regex1" name="Regex Find And Replace" version="0.1.0">
<description></description>
<command interpreter="python">regex.py --input $input --output $out_file1
#for $check in $checks:
--pattern='$check.pattern' --replacement='$check.replacement'
#end for
</command>
<inputs>
<param format="txt" name="input" type="data" label="Select lines from"/>
<repeat name="checks" title="Check">
<param name="pattern" size="40" type="text" value="chr([0-9A-Za-z])+" label="Find Regex" help="here you can enter text or regular expression (for syntax check lower part of this frame)">
<sanitizer>
<valid>
<add preset="string.printable"/>
<remove value="\" />
<remove value="'" />
</valid>
<mapping initial="none">
<add source="\" target="__backslash__" />
<add source="'" target="__sq__"/>
</mapping>
</sanitizer>
</param>
<param name="replacement" size="40" type="text" value="newchr\1" label="Replacement">
<sanitizer>
<valid>
<add preset="string.printable"/>
<remove value="\" />
<remove value="'" />
</valid>
<mapping initial="none">
<add source="\" target="__backslash__" />
<add source="'" target="__sq__"/>
</mapping>
</sanitizer>
</param>
</repeat>
</inputs>
<outputs>
<data format="input" name="out_file1" metadata_source="input"/>
</outputs>
<tests>
<test>
<param name="input" value="find1.txt"/>
<param name="pattern" value="(T\w+)"/>
<param name="replacement" value="\1 \1" />
<output name="out_file1" file="replace1.txt"/>
</test>
<test>
<param name="input" value="find1.txt"/>
<param name="pattern" value="f"/>
<param name="replacement" value="'"" />
<output name="out_file1" file="replace2.txt"/>
</test>
</tests>
<help>
This tool goes line by line through the specified input file and
replaces text which matches the specified regular expression patterns
with its corresponding specified replacement.
This tool uses Python regular expressions. More information about
Python regular expressions can be found here:
http://docs.python.org/library/re.html.
To convert an Ilumina FATSQ sequence id from the CAVASA 8 format::
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
To the CASAVA 7 format::
@EAS139_FC706VJ:2:2104:15343:197393#0/1
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+EAS139_FC706VJ:2:2104:15343:197393#0/1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Use Settings::
Find Regex: ^([@+][A-Z0-9]+):\d+:(\S+)\s(\d).*$
Replacement: \1_\2#0/\3
Note that the parentheses **()** capture patterns in the text that can be used in the replacement text by using a backslash-number reference: **\\1**
The regex **^([@+][A-Z0-9]+):\d+:(\S+) (\d).*$** means::
^ - start the match at the beginning of the line of text
( - start a group (1), that is a string of matched text, that can be back-referenced in the replacement as \1
[@+] - matches either a @ or + character
[A-Z0-9]+ - matches an uppercase letter or a digit, the plus sign means to match 1 or more such characters
) - end a group (1), that is a string of matched text, that can be back-referenced in the replacement as \1
:\d+: - matches a colon followed by one or more digits followed by a colon character
(\S+) - matches one or more non-whitespace charcters, the enclosing parentheses make this a group (2) that can back-referenced in the replacement text as \2
\s - matches a whitespace character
(\d) - matches a single digit character, the enclosing parentheses make this a group (3) that can back-referenced in the replacement text as \3
.* - dot means match any character, asterisk means zero more more matches
$ - the regex must match to the end of the line of text
Galaxy aggressively escapes input supplied to tools, so if something
is not working please let us know and we can look into whether this is
the cause. Also if you would like help constructing regular
expressions for your inputs, please let us know at [email protected].
</help>
</tool>
<tool id="regexColumn1" name="Column Regex Find And Replace" version="0.1.0">
<description></description>
<command interpreter="python">regex.py --input $input --output $out_file1 --column $field
#for $check in $checks:
--pattern='$check.pattern' --replacement='$check.replacement'
#end for
</command>
<inputs>
<param format="tabular" name="input" type="data" label="Select cells from"/>
<param name="field" label="using column" type="data_column" data_ref="input" />
<repeat name="checks" title="Check">
<param name="pattern" size="40" type="text" value="chr([0-9A-Za-z])+" label="Find Regex" help="here you can enter text or regular expression (for syntax check lower part of this frame)">
<sanitizer>
<valid>
<add preset="string.printable"/>
<remove value="\" />
<remove value="'" />
</valid>
<mapping initial="none">
<add source="\" target="__backslash__" />
<add source="'" target="__sq__"/>
</mapping>
</sanitizer>
</param>
<param name="replacement" size="40" type="text" value="newchr\1" label="Replacement">
<sanitizer>
<valid>
<add preset="string.printable"/>
<remove value="\" />
<remove value="'" />
</valid>
<mapping initial="none">
<add source="\" target="__backslash__" />
<add source="'" target="__sq__"/>
</mapping>
</sanitizer>
</param>
</repeat>
</inputs>
<outputs>
<data format="input" name="out_file1" metadata_source="input" />
</outputs>
<tests>
<test>
<param name="input" value="find_tabular_1.txt" ftype="tabular" />
<param name="field" value="1" />
<param name="pattern" value="moo"/>
<param name="replacement" value="cow" />
<output name="out_file1" file="replace_tabular_1.txt"/>
</test>
</tests>
<help>
.. class:: warningmark
**This tool will attempt to reuse the metadata from your first input.** To change metadata assignments click on the "edit attributes" link of the history item generated by this tool.
.. class:: infomark
**TIP:** If your data is not TAB delimited, use *Text Manipulation->Convert*
-----
This tool goes line by line through the specified input file and
if the text in the selected column matches a specified regular expression pattern
replaces the text with the corresponding specified replacement.
This tool can be used to change between the chromosome naming conventions of UCSC and Ensembl.
For example to remove the **chr** part of the reference sequence name in the first column of this GFF file::
##gff-version 2
##Date: Thu Mar 23 11:21:17 2006
##bed2gff.pl $Rev: 601 $
##Input file: ./database/files/61c6c604e0ef50b280e2fd9f1aa7da61.dat
chr1 bed2gff CCDS1000.1_cds_0_0_chr1_148325916_f 148325916 148325975 . + . score "0";
chr21 bed2gff CCDS13614.1_cds_0_0_chr21_32707033_f 32707033 32707192 . + . score "0";
chrX bed2gff CCDS14606.1_cds_0_0_chrX_122745048_f 122745048 122745924 . + . score "0";
Setting::
using column: c1
Find Regex: chr([0-9]+|X|Y|M[Tt]?)
Replacement: \1
produces::
##gff-version 2
##Date: Thu Mar 23 11:21:17 2006
##bed2gff.pl $Rev: 601 $
##Input file: ./database/files/61c6c604e0ef50b280e2fd9f1aa7da61.dat
1 bed2gff CCDS1000.1_cds_0_0_chr1_148325916_f 148325916 148325975 . + . score "0";
21 bed2gff CCDS13614.1_cds_0_0_chr21_32707033_f 32707033 32707192 . + . score "0";
X bed2gff CCDS14606.1_cds_0_0_chrX_122745048_f 122745048 122745924 . + . score "0";
This tool uses Python regular expressions with the **re.sub()** function.
More information about Python regular expressions can be found here:
http://docs.python.org/library/re.html.
The regex **chr([0-9]+|X|Y|M)** means start with text **chr** followed by either: one or more digits, or the letter X, or the letter Y, or the letter M (optionally followed by a single letter T or t).
Note that the parentheses **()** capture patterns in the text that can be used in the replacement text by using a backslash-number reference: **\\1**
Galaxy aggressively escapes input supplied to tools, so if something
is not working please let us know and we can look into whether this is
the cause. Also if you would like help constructing regular
expressions for your inputs, please let us know at [email protected].
</help>
</tool>
import sys
import os
import re
import string
import commands
from optparse import OptionParser
from tempfile import NamedTemporaryFile
def main():
parser = OptionParser()
parser.add_option("--input", dest="input")
parser.add_option("--output", dest="output")
parser.add_option("--pattern", dest="patterns", action="append",
help="regex pattern for replacement")
parser.add_option("--replacement", dest="replacements", action="append",
help="replacement for regex match")
parser.add_option("--column", dest="column", default=None)
(options, args) = parser.parse_args()
mapped_chars = { '\'' :'__sq__', '\\' : '__backslash__' }
column = None
if options.column is not None:
column = int(options.column) - 1 # galaxy tabular is 1-based, python array are zero-based
with open(options.input, 'r') as input:
with open(options.output, 'w') as output:
while True:
line = input.readline()
if line == "":
break
for (pattern, replacement) in zip(options.patterns, options.replacements):
for key, value in mapped_chars.items():
pattern = pattern.replace(value, key)
replacement = replacement.replace(value, key)
if column is None:
line = re.sub(pattern, replacement, line)
else:
cells = line.split("\t")
if cells and len(cells) > column:
cell = cells[column]
cell = re.sub(pattern, replacement, cell)
cells[column] = cell
line = "\t".join(cells)
output.write(line)
if __name__ == "__main__":
main()
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/