Re: [galaxy-dev] pass regular expression

John Chilton Sat, 26 Oct 2013 13:54:14 -0700

We have a pair of tools that do just this we (Jim Johnson and myself)
have developed at MSI that have proven useful in a broad range of
scenarios (these are implemented in Python instead of Perl).  I have
attached them if you are interested if using/extending them instead of
building up new ones. I think there might be some other variations on
this idea in the tool shed.


-John

On Fri, Oct 25, 2013 at 7:43 PM, Jun Fan <[email protected]> wrote:
> Hi all,
>
>
>
>      I am trying to develop a tool which allows the user to use regular
> expression to only keep the lines matching the given pattern in the output.
> There are many special characters, e.g. \ in \d. In the perl script the
> wrapper invokes, the regular expression is printed as Xd+ for the regular
> expression \d+. My question is how to pass these special characters
> including \, (, ), ^ etc. correctly into the perl script.
>
>
>
> Best regards!
>
> Jun
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

<tool id="regex1" name="Regex Find And Replace" version="0.1.0">
  <description></description>
  <command interpreter="python">regex.py --input $input --output $out_file1
    #for $check in $checks:
    --pattern='$check.pattern' --replacement='$check.replacement'
    #end for
  </command>
  <inputs>
    <param format="txt" name="input" type="data" label="Select lines from"/>
    <repeat name="checks" title="Check">
      <param name="pattern" size="40" type="text" value="chr([0-9A-Za-z])+" label="Find Regex" help="here you can enter text or regular expression (for syntax check lower part of this frame)">
        <sanitizer>
          <valid>
            <add preset="string.printable"/>
            <remove value="&#92;" />
            <remove value="&apos;" />
          </valid>
          <mapping initial="none">
            <add source="&#92;" target="__backslash__" />
            <add source="&apos;" target="__sq__"/>
          </mapping>
        </sanitizer>
      </param>
      <param name="replacement" size="40" type="text" value="newchr\1" label="Replacement">
        <sanitizer>
          <valid>
            <add preset="string.printable"/>
            <remove value="&#92;" />
            <remove value="&apos;" />
          </valid>
          <mapping initial="none">
            <add source="&#92;" target="__backslash__" />
            <add source="&apos;" target="__sq__"/>
          </mapping>
        </sanitizer>      
      </param>
    </repeat>
  </inputs>
  <outputs>
    <data format="input" name="out_file1" metadata_source="input"/>
  </outputs>
  <tests>
    <test>
      <param name="input" value="find1.txt"/>
      <param name="pattern" value="(T\w+)"/>
      <param name="replacement" value="\1 \1" />
      <output name="out_file1" file="replace1.txt"/>
    </test>
    <test>
      <param name="input" value="find1.txt"/>
      <param name="pattern" value="f"/>
      <param name="replacement" value="'&quot;" />
      <output name="out_file1" file="replace2.txt"/>
    </test>
  </tests>
  <help>
This tool goes line by line through the specified input file and
replaces text which matches the specified regular expression patterns
with its corresponding specified replacement.

This tool uses Python regular expressions. More information about
Python regular expressions can be found here:
http://docs.python.org/library/re.html.

To convert an Ilumina FATSQ sequence id from the CAVASA 8 format::

 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
 +EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

To the CASAVA 7 format::

 @EAS139_FC706VJ:2:2104:15343:197393#0/1
 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
 +EAS139_FC706VJ:2:2104:15343:197393#0/1
 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Use Settings::

 Find Regex: ^([@+][A-Z0-9]+):\d+:(\S+)\s(\d).*$
 Replacement: \1_\2#0/\3

Note that the parentheses **()** capture patterns in the text that can be used in the replacement text by using a backslash-number reference:  **\\1**

The regex **^([@+][A-Z0-9]+):\d+:(\S+) (\d).*$** means::

  ^  - start the match at the beginning of the line of text
  (  - start a group (1), that is a string of matched text, that can be back-referenced in the replacement as \1
  [@+]  - matches either a @ or + character
  [A-Z0-9]+  - matches an uppercase letter or a digit, the plus sign means to match 1 or more such characters
  )  - end a group (1), that is a string of matched text, that can be back-referenced in the replacement as \1
  :\d+:   - matches a colon followed by one or more digits followed by a colon character
  (\S+)  - matches one or more non-whitespace charcters,  the enclosing parentheses make this a group (2) that can back-referenced in the replacement text as \2
  \s  - matches a whitespace character
  (\d)  - matches a single digit character,  the enclosing parentheses make this a group (3) that can back-referenced in the replacement text as \3
  .*  - dot means match any character, asterisk means zero more more matches
  $  - the regex must match to the end of the line of text



Galaxy aggressively escapes input supplied to tools, so if something
is not working please let us know and we can look into whether this is
the cause. Also if you would like help constructing regular
expressions for your inputs, please let us know at [email protected].
</help>
</tool>

<tool id="regexColumn1" name="Column Regex Find And Replace" version="0.1.0">
  <description></description>
  <command interpreter="python">regex.py --input $input --output $out_file1 --column $field
    #for $check in $checks:
    --pattern='$check.pattern' --replacement='$check.replacement'
    #end for
  </command>
  <inputs>
    <param format="tabular" name="input" type="data" label="Select cells from"/>
    <param name="field" label="using column" type="data_column" data_ref="input" />
    <repeat name="checks" title="Check">
      <param name="pattern" size="40" type="text" value="chr([0-9A-Za-z])+" label="Find Regex" help="here you can enter text or regular expression (for syntax check lower part of this frame)">
        <sanitizer>
          <valid>
            <add preset="string.printable"/>
            <remove value="&#92;" />
            <remove value="&apos;" />
          </valid>
          <mapping initial="none">
            <add source="&#92;" target="__backslash__" />
            <add source="&apos;" target="__sq__"/>
          </mapping>
        </sanitizer>
      </param>
      <param name="replacement" size="40" type="text" value="newchr\1" label="Replacement">
        <sanitizer>
          <valid>
            <add preset="string.printable"/>
            <remove value="&#92;" />
            <remove value="&apos;" />
          </valid>
          <mapping initial="none">
            <add source="&#92;" target="__backslash__" />
            <add source="&apos;" target="__sq__"/>
          </mapping>
        </sanitizer>      
      </param>
    </repeat>
  </inputs>
  <outputs>
    <data format="input" name="out_file1" metadata_source="input" />
  </outputs>
  <tests>
    <test>
      <param name="input" value="find_tabular_1.txt" ftype="tabular" />
      <param name="field" value="1" />
      <param name="pattern" value="moo"/>
      <param name="replacement" value="cow" />
      <output name="out_file1" file="replace_tabular_1.txt"/>
    </test>
  </tests>
  <help>

.. class:: warningmark

**This tool will attempt to reuse the metadata from your first input.** To change metadata assignments click on the "edit attributes" link of the history item generated by this tool.

.. class:: infomark

**TIP:** If your data is not TAB delimited, use *Text Manipulation-&gt;Convert*

-----

This tool goes line by line through the specified input file and
if the text in the selected column matches a specified regular expression pattern
replaces the text with the corresponding specified replacement.

This tool can be used to change between the chromosome naming conventions of UCSC and Ensembl.  

For example to remove the **chr** part of the reference sequence name in the first column of this GFF file::

 ##gff-version 2
 ##Date: Thu Mar 23 11:21:17 2006
 ##bed2gff.pl $Rev: 601 $
 ##Input file: ./database/files/61c6c604e0ef50b280e2fd9f1aa7da61.dat
 chr1	bed2gff	CCDS1000.1_cds_0_0_chr1_148325916_f	148325916	148325975	.	+	.	score "0";
 chr21	bed2gff	CCDS13614.1_cds_0_0_chr21_32707033_f	32707033	32707192	.	+	.	score "0";
 chrX	bed2gff	CCDS14606.1_cds_0_0_chrX_122745048_f	122745048	122745924	.	+	.	score "0";

Setting::

 using column: c1 
 Find Regex: chr([0-9]+|X|Y|M[Tt]?) 
 Replacement: \1 

produces::

 ##gff-version 2
 ##Date: Thu Mar 23 11:21:17 2006
 ##bed2gff.pl $Rev: 601 $
 ##Input file: ./database/files/61c6c604e0ef50b280e2fd9f1aa7da61.dat
 1    bed2gff CCDS1000.1_cds_0_0_chr1_148325916_f     148325916       148325975       .       +       .       score "0";
 21   bed2gff CCDS13614.1_cds_0_0_chr21_32707033_f    32707033        32707192        .       +       .       score "0";
 X    bed2gff CCDS14606.1_cds_0_0_chrX_122745048_f    122745048       122745924       .       +       .       score "0";


This tool uses Python regular expressions with the **re.sub()** function. 
More information about Python regular expressions can be found here:
http://docs.python.org/library/re.html.

The regex **chr([0-9]+|X|Y|M)** means start with text **chr** followed by either: one or more digits, or the letter X, or the letter Y, or the letter M (optionally followed by a single letter T or t).  
Note that the parentheses **()** capture patterns in the text that can be used in the replacement text by using a backslash-number reference:  **\\1**



Galaxy aggressively escapes input supplied to tools, so if something
is not working please let us know and we can look into whether this is
the cause. Also if you would like help constructing regular
expressions for your inputs, please let us know at [email protected].

</help>
</tool>

import sys
import os
import re
import string
import commands
from optparse import OptionParser
from tempfile import NamedTemporaryFile

def main():
  parser = OptionParser()
  parser.add_option("--input", dest="input")
  parser.add_option("--output", dest="output")
  parser.add_option("--pattern", dest="patterns", action="append",
                    help="regex pattern for replacement")
  parser.add_option("--replacement", dest="replacements", action="append",
                    help="replacement for regex match")
  parser.add_option("--column", dest="column", default=None)
  (options, args) = parser.parse_args()

  mapped_chars = { '\'' :'__sq__', '\\' : '__backslash__' }

  column = None
  if options.column is not None:
    column = int(options.column) - 1 # galaxy tabular is 1-based, python array are zero-based 

  with open(options.input, 'r') as input:
    with open(options.output, 'w') as output:
      while True:
        line = input.readline()
        if line == "":
          break
        for (pattern, replacement) in zip(options.patterns, options.replacements):
          for key, value in mapped_chars.items():
            pattern = pattern.replace(value, key)
            replacement = replacement.replace(value, key)
          if column is None:
            line = re.sub(pattern, replacement, line)
          else:
            cells = line.split("\t")
            if cells and len(cells) > column:
              cell = cells[column]
              cell = re.sub(pattern, replacement, cell)
              cells[column] = cell
              line = "\t".join(cells)
        output.write(line)

if __name__ == "__main__":
    main()

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] pass regular expression

Reply via email to