Our first version of cross-platform versions of train-model.perl, mert-moses.pl and their sub-scripts is ready for community testing.

In total, we touched 12 Perl and Python files, including one from MGIZA++. These are the files. We have changed the file names so the two versions can reside side-by side in the same scripts tree. Please note the respective -x or _x suffix before the file extension.

1. lib/mgizapp/scripts/merge_alignment_x.py
2. lib/mosesdecoder/scripts/training/train-model-x.perl
3. lib/mosesdecoder/scripts/training/reduce_combine_x.pl
4. lib/mosesdecoder/scripts/training/mert-moses-x.pl
5. lib/mosesdecoder/scripts/training/LexicalTranslationModelX.pm
6. lib/mosesdecoder/scripts/training/giza2bal-x.pl
7. lib/mosesdecoder/scripts/training/flexibility_score_x.py
8. lib/mosesdecoder/scripts/training/filter-rule-table-x.py
9. lib/mosesdecoder/scripts/training/filter-model-given-input-x.pl
10. lib/mosesdecoder/scripts/generic/score-parallel-x.perl
11. lib/mosesdecoder/scripts/generic/moses_sim_pe_x.py
12. lib/mosesdecoder/scripts/generic/extract-parallel-x.perl

Later this week, we will add these files to trunk. I invite the community's scrutiny on both the Posix and Windows runtimes. These updates required some heavy-duty clean-up. These are a significant departure from the previous scripts. So, we took the liberty to also add some new features. Overall, however, the scripts are intended to run identically to the existing Posix-only versions. I'm happy to discuss our strategy/design with anyone who is interested.

UPDATES:

1. Gracefully exit from forked processes. Now, the script catches a
   subprocess failure shuts down ASAP. Typically, this means the other
   forks to terminate. This means a failure in step 2 (GIZA), will
   terminate during step 2. The users' won't have to trace problems
   that reveal themselves in step 5 back to step 2.
2. Temp file space. The --temp-dir argument is now fully operational in
   all steps that create temp files. By default, train-model.perl uses
   Perl to create a subfolder under the system temp folder. Perl then
   cleans up that folder when the context terminates, unless there's a
   crash when it then leaves the folder there (nice touch for
   troubleshooting). The user can manually define a --temp-dir from the
   command line. These manually-created folders are not deleted. Users
   who define their own might probably want to troubleshoot problems.
   So, when update (1) above reports out of drive space errors, users
   can simply mount a new drive and point to it with --temp-dir
3. All forked processing happens in child processes. The script spins
   off child processes and the parent process simply sits idle and
   waits. Code management is easier this way instead of all children in
   some places and parent/children for other places.
4. The scripts have copies of many housekeeping functions (sub) for
   things like concatenating new_path(), makedirs(), removedirs(),
   normalize_path(), and others.
5. Those familiar with Windows environments will notice that we
   concatenate a ".exe" to each Windows binary even Windows does not
   require this. The objective is to minimize tech support requests. If
   we do not rely on the .exe extension, it is possible for the Windows
   environment to launch a .bat, .cmd, or other file instead of the
   binary. Granted, it's probably an edge case, but you never know in
   the WIndows world.
6. Full support for data's relative paths from the command prompt. The
   path normalization uses Perl's rel2abs() and Python's
   os.path.abspath() for every file path on the disk system whether
   typed in by a user or read from a config file. The absolute
   reference is always the user's current working directory (CWD).
7. The path concatenation always normalizes path strings to using the
   current host's OS separator. Although Perl and Python (maybe C?) can
   tolerate forward slashes for Windows paths, we normalized them. This
   is necessary for command-line arguments passed to launch new
   subprocesses. If we do it there, its just as easy for everything.
8. One interesting update. The train-model.perl script never checked if
   the corpus.f and corpus.e files were present on the file system. We
   added that and checks for many other input files, too.


WHAT'S NEXT:

In a parallel effort, we have been working to update the binaries. Like the Perl scripts, the first step is to remove all "posix" shell calls except for the 8 dependencies listed below. This means native C/C++ code will replace calls to 'mkdir', 'rm', etc. Concurrently, we are also removing hard-coded path separators and concatenating all paths with native OS separators. Once the Posix dependencies are gone, we'll address specific import library support.

Again, I invite list members to test the new version. Verifying/validating parity performance on Linux is the first QC step. Thanks all!

Tom




These scripts require these dependencies (sorry, not by step) be present and executable of they will fail to run.

A) infrastructure support. We tested with the native Windows binaries of Gow and Cygwin versions. There are some bugs, but we need to have full binaries before we can fully debug their use. The system checks the these files are in the path present by executing and reading their return-codes.

1. gsplit[.exe] or split[.exe]
2. gsort[.exe] or sort[.exe]
3. pigz[.exe]  or gzip[.exe]
4. unpigz[.exe] or gunzip[.exe]
5. bzcat[.exe]
6. uniq[.exe]
7. zcat[.exe] (gzip[.exe])
8. cat or type


B) External binaries & scripts i.e. --external-bin-dir for (M)GIZA++. The system checks the these files are present by executing their predicted installed locations and reading their return-codes.

1. GIZA++[.exe] or mgiza[.exe] or mgizapp[.exe]
2. snt2cooc.out[.exe] or snt2cooc[.exe] (or snt2cooc.pl if configured)
3. mkcls[.exe]
4. merge_alignment_x.py

C) Moses binaries & scripts

1. giza2bal-x.pl
2. symal.[exe]
3. extract[.exe]
4. eppex[.exe]
5. extract-rules[.exe]
6. extract-ghkm[.exe]
7. extract-parallel-x.perl
8. score.[exe]
9. memscore[.exe]
10. consolidate[.exe]
11. score-parallel-x.perl
12. flexibility_score_x.py
13. mert[.exe]
14. extractor[.exe]
15. pro[.exe]
16. kbmira[.exe]
17. evaluator[.exe]
18. filter-model-given-input-x.pl
19. remove-segmentation-markup.perl
20. promix/main.py (not complete)
21. moses[.exe] or moses-cmd.exe
22. moses_sim_pe_x.py
23. NOTE: NOT qsub-wrapper-x.pl or other moses-parallel works


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to