Thanks hwidong,
I agree sed and many other tools are capable. DoMY's scripting
environment is our CorpusFiltergraph. It's goal is to create a filter
graph (tool chain) that processes parallel "streams" of input from one
source language and multiple target languages through a series of
language-specific filters *AND* ensures alignment of all data across all
graphs. sed can't do that.
The documentation for CorpusFiltergraph is sparse. The installation
process adds three shortcut links to your Desktop or $HOME folder
depending on whether you're in a GUI environment. The installation also
adds a folder to the user's home folder ~/domy-graphs. Its sub-folders
are the graphs. Each graph includes a config.ini file. Check out the
graph ~/domy-graphs/build-tm/config.ini.
The [output], renderlanguages= option defines a comma-separated list of
languages for which to build graphs of filters. Each language has its
own section with filterX= options. The X is the sequence order of the
filter. The value of the filterX= option is a comma-separated
definition. The first element is a code for the location of the plugin
(python module).
0= /usr/lib/corpusfg/plugins (graph agnostic/language agnostic)
1= /usr/lib/corpusfg/plugins/<lang> (graph agnostic/language specific)
2= ~/domy-graphs/<graph> (graph specific/language agnostic)
3= ~/domy-graphs/<graph>/<lang> (graph specific/language specific)
4= search the environment path
5= use relative path from /usr/lib/corpusfg as home
The filter graph manager first processes the source language graph
(i.e. NOT listed in [input], sourcelanguages=), and then it processes
the target language graphs. The filter graph manager finally process the
aligners by target languages. The aligner is responsible to "align" the
target to the source language. Here are the relevant sections from
~/domy-graphs/build-tm/config.ini
[input]
SourceLanguages=nl # nl is the source language
[output]
renderlanguages=en,nl # build filter graphs for en and nl
(nl is source language from
above)
[en] # English language graph definition
filter0=0,subprocess-tokenizer.perl # filter plugin #0 (uses
subprocess.popen to call the
Moses
tokenizer.perl script for English
filter1=0,replace-escape-control # filter plugin #1 (subject of this
moses-support thread)
filter2=0,convert-lowercase # filter plugin #2 (self
explanitory)
aligner0=0,aligner-build-tm # aligner plugin #1 (see section
below)
[en,0,0,aligner-build-tm]
If a filter or aligner needs configuration, it looks in a section keyed
to the graph. This section [en,0,0,aligner-build-tm] tells the plugin
what sub-folder names to use (it's pretty cryptic now sorry), how many
random pairs to pull for the MERT tuning set (same number gets pulled
for the evaluation set). It also says where to find our makemetval
python module which generates the .sgm files for mteval-v12.pl (note, we
erroneously call them .xml files and will change this in a future
version).
[en,0,0,aligner-build-tm]
extract2=builds
extract3=tm
extract4=~project~_sample_corpus
extractRoot=bitext
extractTypes=nl,en
makemteval=4,makemteval
setid=SetID
refid=RefID
sysid=SysID
docid=DocID
genre=Genre
mertset=500
By the way, all of these graphs have been tested on both Linux and MS
Windows with Python 2.6x. We use mteval-v12.pl because mteval-v13.pl has
a dependency that won't run on Windows the last I checked. We test with
and use Strawberry Perl 5.10.1.
Finally when it comes time to translate, you can re-create the graph in
your translation sequence to ensure pre-precessing before translation is
identical. See the ~/domy-graphs/build-tm/config.ini graph for an
example. It takes our technicians about 5 minutes to create a new graph
for specific data processing. Creating new plugins is easy because the
technician doesn't usually need to worry about where to find the file or
where to put it. The config.ini and graph manager worry about that, not
the plugin (most of the time).
Sorry if this is too long. I hope it's useful. If there are any python
volunteers out there, we'd love your input/participation to improve
CorpusFiltergraph.
Regards,
Tom
On Thu, 17 Feb 2011 22:35:48 +0900, Hwidong Na <[email protected]>
wrote:
> Hi Tom,
>
> Although I think command line tools such as sed are able to convert
> the
> problematic characters, I've installed DoMy CE and try it. But I
> wonder
> how can I use two plugin modules. Do you have any document for them?
>
> Best regards,
> --
> Hwidong Na <[email protected]>
> KLE lab, POSTECH, KOREA
>
> 2011-02-14 (월), 14:21 +0700, Tom Hoar:
>> Hi hwidong,
>>
>> This link lists a table of problematic character and character
>> sequences that must be removed or escaped before training a
>> translation
>> model, and before translating your new work.
>>
>> DoMY includes two plugin modules, replace-escape-control.py and
>> replace-unescape-control.py, that escape and un-escape these
>> characters.
>>
>>
>> http://www.precisiontranslationtools.com/index.php?option=com_content&view=article&id=94:are-there-characters-that-cause-problems-in-moses&catid=30:key-concepts&Itemid=57
>>
>> Regards,
>> Tom
>>
>>
>> On Mon, 14 Feb 2011 13:29:54 +0800, Hieu Hoang
>> <[email protected]>
>> wrote:
>> > Hi hwidong
>> >
>> > You probably have to preprosess the corpus to get rid of < and >
>> > symbols, as well as [ and ] symbols
>> >
>> > Hieu
>> > Sent from my flying horse
>> >
>> > On 14 Feb 2011, at 11:30 AM, Hwidong Na <[email protected]>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> When I extract hierarchical phrases using the EMS. The extraction
>> >> step
>> >> step crashed, and it seems to identify xml tags during the
>> >> extraction.
>> >> For example, one of the error messages is
>> >>
>> >> ERROR: malformed XML: It was kept in the ice bath for 30 min , at
>> >> ambient temperature for 2 h and at < 0 " C for 18 h . It was then
>> >> diluted with CH2Cl2 , washed with water and brine , dried ( MgSO4
>> )
>> >> and
>> >> concentrated .
>> >> no target (0) or source (43) words << end insentence 993688
>> >> T: It was kept in the ice bath for 30 min , at ambient
>> temperature
>> >> for 2
>> >> h and at < 0 " C for 18 h . It was then diluted with CH2Cl2 ,
>> washed
>> >> with water and brine , dried ( MgSO4 ) and concentrated .
>> >> S: 将 其 在 冰浴 中 放置 30 分钟 , 室温 放置 2 小时 , 然后 在 < 0 ℃
>> >> 下 放置 18 小时 。 将 其 用 CH2Cl2 稀释 , 用水 和 盐 水 洗涤 , 干燥
>> >> ( MgSO4 ) 并 浓缩 。
>> >>
>> >> The revision number is 3729. Should I update to the newest
>> revision?
>> >>
>> >> Best regards,
>> >> --
>> >> Hwidong Na <[email protected]>
>> >> KLE lab, POSTECH, KOREA
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> [email protected]
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>>
>>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support