Our first version of cross-platform versions of train-model.perl,
mert-moses.pl and their sub-scripts is ready for community testing.
In total, we touched 12 Perl and Python files, including one from
MGIZA++. These are the files. We have changed the file names so the two
versions can reside side-by side in the same scripts tree. Please note
the respective -x or _x suffix before the file extension.
1. lib/mgizapp/scripts/merge_alignment_x.py
2. lib/mosesdecoder/scripts/training/train-model-x.perl
3. lib/mosesdecoder/scripts/training/reduce_combine_x.pl
4. lib/mosesdecoder/scripts/training/mert-moses-x.pl
5. lib/mosesdecoder/scripts/training/LexicalTranslationModelX.pm
6. lib/mosesdecoder/scripts/training/giza2bal-x.pl
7. lib/mosesdecoder/scripts/training/flexibility_score_x.py
8. lib/mosesdecoder/scripts/training/filter-rule-table-x.py
9. lib/mosesdecoder/scripts/training/filter-model-given-input-x.pl
10. lib/mosesdecoder/scripts/generic/score-parallel-x.perl
11. lib/mosesdecoder/scripts/generic/moses_sim_pe_x.py
12. lib/mosesdecoder/scripts/generic/extract-parallel-x.perl
Later this week, we will add these files to trunk. I invite the
community's scrutiny on both the Posix and Windows runtimes. These
updates required some heavy-duty clean-up. These are a significant
departure from the previous scripts. So, we took the liberty to also add
some new features. Overall, however, the scripts are intended to run
identically to the existing Posix-only versions. I'm happy to discuss
our strategy/design with anyone who is interested.
UPDATES:
1. Gracefully exit from forked processes. Now, the script catches a
subprocess failure shuts down ASAP. Typically, this means the other
forks to terminate. This means a failure in step 2 (GIZA), will
terminate during step 2. The users' won't have to trace problems
that reveal themselves in step 5 back to step 2.
2. Temp file space. The --temp-dir argument is now fully operational in
all steps that create temp files. By default, train-model.perl uses
Perl to create a subfolder under the system temp folder. Perl then
cleans up that folder when the context terminates, unless there's a
crash when it then leaves the folder there (nice touch for
troubleshooting). The user can manually define a --temp-dir from the
command line. These manually-created folders are not deleted. Users
who define their own might probably want to troubleshoot problems.
So, when update (1) above reports out of drive space errors, users
can simply mount a new drive and point to it with --temp-dir
3. All forked processing happens in child processes. The script spins
off child processes and the parent process simply sits idle and
waits. Code management is easier this way instead of all children in
some places and parent/children for other places.
4. The scripts have copies of many housekeeping functions (sub) for
things like concatenating new_path(), makedirs(), removedirs(),
normalize_path(), and others.
5. Those familiar with Windows environments will notice that we
concatenate a ".exe" to each Windows binary even Windows does not
require this. The objective is to minimize tech support requests. If
we do not rely on the .exe extension, it is possible for the Windows
environment to launch a .bat, .cmd, or other file instead of the
binary. Granted, it's probably an edge case, but you never know in
the WIndows world.
6. Full support for data's relative paths from the command prompt. The
path normalization uses Perl's rel2abs() and Python's
os.path.abspath() for every file path on the disk system whether
typed in by a user or read from a config file. The absolute
reference is always the user's current working directory (CWD).
7. The path concatenation always normalizes path strings to using the
current host's OS separator. Although Perl and Python (maybe C?) can
tolerate forward slashes for Windows paths, we normalized them. This
is necessary for command-line arguments passed to launch new
subprocesses. If we do it there, its just as easy for everything.
8. One interesting update. The train-model.perl script never checked if
the corpus.f and corpus.e files were present on the file system. We
added that and checks for many other input files, too.
WHAT'S NEXT:
In a parallel effort, we have been working to update the binaries. Like
the Perl scripts, the first step is to remove all "posix" shell calls
except for the 8 dependencies listed below. This means native C/C++ code
will replace calls to 'mkdir', 'rm', etc. Concurrently, we are also
removing hard-coded path separators and concatenating all paths with
native OS separators. Once the Posix dependencies are gone, we'll
address specific import library support.
Again, I invite list members to test the new version.
Verifying/validating parity performance on Linux is the first QC step.
Thanks all!
Tom
These scripts require these dependencies (sorry, not by step) be present
and executable of they will fail to run.
A) infrastructure support. We tested with the native Windows binaries of
Gow and Cygwin versions. There are some bugs, but we need to have full
binaries before we can fully debug their use. The system checks the
these files are in the path present by executing and reading their
return-codes.
1. gsplit[.exe] or split[.exe]
2. gsort[.exe] or sort[.exe]
3. pigz[.exe] or gzip[.exe]
4. unpigz[.exe] or gunzip[.exe]
5. bzcat[.exe]
6. uniq[.exe]
7. zcat[.exe] (gzip[.exe])
8. cat or type
B) External binaries & scripts i.e. --external-bin-dir for (M)GIZA++.
The system checks the these files are present by executing their
predicted installed locations and reading their return-codes.
1. GIZA++[.exe] or mgiza[.exe] or mgizapp[.exe]
2. snt2cooc.out[.exe] or snt2cooc[.exe] (or snt2cooc.pl if configured)
3. mkcls[.exe]
4. merge_alignment_x.py
C) Moses binaries & scripts
1. giza2bal-x.pl
2. symal.[exe]
3. extract[.exe]
4. eppex[.exe]
5. extract-rules[.exe]
6. extract-ghkm[.exe]
7. extract-parallel-x.perl
8. score.[exe]
9. memscore[.exe]
10. consolidate[.exe]
11. score-parallel-x.perl
12. flexibility_score_x.py
13. mert[.exe]
14. extractor[.exe]
15. pro[.exe]
16. kbmira[.exe]
17. evaluator[.exe]
18. filter-model-given-input-x.pl
19. remove-segmentation-markup.perl
20. promix/main.py (not complete)
21. moses[.exe] or moses-cmd.exe
22. moses_sim_pe_x.py
23. NOTE: NOT qsub-wrapper-x.pl or other moses-parallel works
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support