Thanks hwidong,

 I agree sed and many other tools are capable. DoMY's scripting 
 environment is our CorpusFiltergraph. It's goal is to create a filter 
 graph (tool chain) that processes parallel "streams" of input from one 
 source language and multiple target languages through a series of 
 language-specific filters *AND* ensures alignment of all data across all 
 graphs. sed can't do that.

 The documentation for CorpusFiltergraph is sparse. The installation 
 process adds three shortcut links to your Desktop or $HOME folder 
 depending on whether you're in a GUI environment. The installation also 
 adds a folder to the user's home folder ~/domy-graphs. Its sub-folders 
 are the graphs. Each graph includes a config.ini file. Check out the 
 graph ~/domy-graphs/build-tm/config.ini.

 The [output], renderlanguages= option defines a comma-separated list of 
 languages for which to build graphs of filters. Each language has its 
 own section with filterX= options. The X is the sequence order of the 
 filter. The value of the filterX= option is a comma-separated 
 definition. The first element is a code for the location of the plugin 
 (python module).

 0= /usr/lib/corpusfg/plugins        (graph agnostic/language agnostic)
 1= /usr/lib/corpusfg/plugins/<lang> (graph agnostic/language specific)
 2= ~/domy-graphs/<graph>            (graph specific/language agnostic)
 3= ~/domy-graphs/<graph>/<lang>     (graph specific/language specific)
 4= search the environment path
 5= use relative path from /usr/lib/corpusfg as home

 The filter graph manager first processes the source language graph 
 (i.e. NOT listed in [input], sourcelanguages=), and then it processes 
 the target language graphs. The filter graph manager finally process the 
 aligners by target languages. The aligner is responsible to "align" the 
 target to the source language. Here are the relevant sections from 
 ~/domy-graphs/build-tm/config.ini

 [input]
 SourceLanguages=nl                  # nl is the source language

 [output]
 renderlanguages=en,nl               # build filter graphs for en and nl
                                       (nl is source language from 
 above)

 [en] # English language graph definition
 filter0=0,subprocess-tokenizer.perl # filter plugin #0 (uses
                                       subprocess.popen to call the 
 Moses
                                       tokenizer.perl script for English
 filter1=0,replace-escape-control    # filter plugin #1 (subject of this
                                       moses-support thread)
 filter2=0,convert-lowercase         # filter plugin #2 (self 
 explanitory)
 aligner0=0,aligner-build-tm         # aligner plugin #1 (see section 
 below)
                                       [en,0,0,aligner-build-tm]

 If a filter or aligner needs configuration, it looks in a section keyed 
 to the graph. This section [en,0,0,aligner-build-tm] tells the plugin 
 what sub-folder names to use (it's pretty cryptic now sorry), how many 
 random pairs to pull for the MERT tuning set (same number gets pulled 
 for the evaluation set). It also says where to find our makemetval 
 python module which generates the .sgm files for mteval-v12.pl (note, we 
 erroneously call them .xml files and will change this in a future 
 version).

 [en,0,0,aligner-build-tm]
 extract2=builds
 extract3=tm
 extract4=~project~_sample_corpus
 extractRoot=bitext
 extractTypes=nl,en
 makemteval=4,makemteval
 setid=SetID
 refid=RefID
 sysid=SysID
 docid=DocID
 genre=Genre
 mertset=500

 By the way, all of these graphs have been tested on both Linux and MS 
 Windows with Python 2.6x. We use mteval-v12.pl because mteval-v13.pl has 
 a dependency that won't run on Windows the last I checked. We test with 
 and use Strawberry Perl 5.10.1.

 Finally when it comes time to translate, you can re-create the graph in 
 your translation sequence to ensure pre-precessing before translation is 
 identical. See the ~/domy-graphs/build-tm/config.ini graph for an 
 example. It takes our technicians about 5 minutes to create a new graph 
 for specific data processing. Creating new plugins is easy because the 
 technician doesn't usually need to worry about where to find the file or 
 where to put it. The config.ini and graph manager worry about that, not 
 the plugin (most of the time).

 Sorry if this is too long. I hope it's useful. If there are any python 
 volunteers out there, we'd love your input/participation to improve 
 CorpusFiltergraph.

 Regards,
 Tom



 On Thu, 17 Feb 2011 22:35:48 +0900, Hwidong Na <[email protected]> 
 wrote:
> Hi Tom,
>
> Although I think command line tools such as sed are able to convert 
> the
> problematic characters, I've installed DoMy CE and try it. But I 
> wonder
> how can I use two plugin modules. Do you have any document for them?
>
> Best regards,
> --
> Hwidong Na <[email protected]>
> KLE lab, POSTECH, KOREA
>
> 2011-02-14 (월), 14:21 +0700, Tom Hoar:
>> Hi hwidong,
>>
>>  This link lists a table of problematic character and character
>>  sequences that must be removed or escaped before training a 
>> translation
>>  model, and before translating your new work.
>>
>>  DoMY includes two plugin modules, replace-escape-control.py and
>>  replace-unescape-control.py, that escape and un-escape these 
>> characters.
>>
>>  
>> http://www.precisiontranslationtools.com/index.php?option=com_content&view=article&id=94:are-there-characters-that-cause-problems-in-moses&catid=30:key-concepts&Itemid=57
>>
>>  Regards,
>>  Tom
>>
>>
>>  On Mon, 14 Feb 2011 13:29:54 +0800, Hieu Hoang 
>> <[email protected]>
>>  wrote:
>> > Hi hwidong
>> >
>> > You probably have to preprosess the corpus to get rid of < and >
>> > symbols, as well as [ and ] symbols
>> >
>> > Hieu
>> > Sent from my flying horse
>> >
>> > On 14 Feb 2011, at 11:30 AM, Hwidong Na <[email protected]> 
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> When I extract hierarchical phrases using the EMS. The extraction
>> >> step
>> >> step crashed, and it seems to identify xml tags during the
>> >> extraction.
>> >> For example, one of the error messages is
>> >>
>> >> ERROR: malformed XML: It was kept in the ice bath for 30 min , at
>> >> ambient temperature for 2 h and at < 0 " C for 18 h . It was then
>> >> diluted with CH2Cl2 , washed with water and brine , dried ( MgSO4 
>> )
>> >> and
>> >> concentrated .
>> >> no target (0) or source (43) words << end insentence 993688
>> >> T: It was kept in the ice bath for 30 min , at ambient 
>> temperature
>> >> for 2
>> >> h and at < 0 " C for 18 h . It was then diluted with CH2Cl2 , 
>> washed
>> >> with water and brine , dried ( MgSO4 ) and concentrated .
>> >> S: 将 其 在 冰浴 中 放置 30 分钟 , 室温 放置 2 小时 , 然后 在 < 0 ℃
>> >> 下 放置 18 小时 。 将 其 用 CH2Cl2 稀释 , 用水 和 盐 水 洗涤 , 干燥
>> >> ( MgSO4 ) 并 浓缩 。
>> >>
>> >> The revision number is 3729. Should I update to the newest 
>> revision?
>> >>
>> >> Best regards,
>> >> --
>> >> Hwidong Na <[email protected]>
>> >> KLE lab, POSTECH, KOREA
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> [email protected]
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>>


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to