Thanks, Hieu.

Here are the two scripts. Add them to the contrib (or other) folder as you see fit. They have a -h switch. The moses2-to-ini.py script copies everything from moses.ini v2 format into a traditional INI file format. The ini-2-moses2.py script restores the moses.ini v2 format.

They change the line order in the [feature] and [weight] sections. The overall order of section blocks can change. They preserve the order of lines/values inside the old v1 moses.ini format sections, like [input-factor], [mapping], etc. Please report any errors.

There two limits. First, the FF name= attribute is not optional because relying on sequential orders can be risky. Users should declare the name to match the weight value key. Second, they do not support use cases where non-tuneable FF does not include weights. If you place dummy weights, like you did with UnknownWordPenalty, everything is ok.

There's one additional feature. Users can use the --escape-prefix command-line option to escape a prefix path in FF "path=" attribute values. This will be helpful in moving models. I noticed that the clone_moses_model.pl script has not been updated to support the new moses.ini file format. Maybe someone would like to use this as a starting point.

I find it easier/faster to manipulate values a traditional INI file structure. Users can also import classes in the scripts into other modules. They have a pretty simple API. I thought others might find these useful.

Tom


On 02/02/2015 08:56 PM, Hieu Hoang wrote:
Hi Tom

On 02/02/15 01:33, Tom Hoar wrote:
Much of the v2 moses.ini looks self-explanatory, but I'd like to confirm my understanding.

The website (http://www.statmt.org/moses/?n=Moses.FeatureFunctions) defines three feature/functions without arguments. In the moses.ini files made by train-model.perl's step 9, there also appears to be a 4th that requires no argument. Can someone confirm this is the case? Are there others that could appear without arguments?
Yes, PhrasePenalty is a standard feature function (FF), it doesn't need any arguments. It used to be the constant 2.718 in the last score in the phrase-table. But i thought that was silly so move it to it's own feature function.

    [feature]
    UnknownWordPenalty
    WordPenalty
    Distortion
    PhrasePenalty * - not listed on the website (are there more)

Feature/functions in the [feature] section and items in the [weight] sections appear to be linked. The feature/functions without arguments have corresponding entries linked by the same option name with an appended zero in the [weight] section. Since these feature/functions have arguments, is it safe to say that they can appear only once in both the [feature] and [weight] sections?
You can have multiple feature function of the same type. The most obvious 1 would be having multiple LM, eg.
   [feature]
   KENLM path=file1.lm
   KENLM path=file2.lm
Each instance of an FF must have a unique name, if you don't give inamet a , the decoder will name it for you, KENLM0, KENLM1, .... Each instance must have the corresponding weights (unless it's non-tuneable)
   [weight]
   KENLM0= 0.4
   KENLM1= 0.6

    [weight]
    UnknownWordPenalty0= 1
    WordPenalty0= -1
    Distortion0= 0.3
    PhrasePenalty0= 0.2

The feature/functions arguments have corresponding entries liked by the "name=" argument as the option name in the [weight] section. Are there cases where there will be entries in the [feature] section without corresponding entries in the [weight] section or vice-versa?
Yes, if the FF is not tuneable, you don't need to give it weight(s). The hardcoded weights will be used. UnknownWordPenalty is a non-tuneable FF and its hardcoded weight is 1, so the line
   UnknownWordPenalty0= 1
isn't strictly necessary.

Whether a FF is non-tuneable or not by default is determined in the code. You can also set the non-tuneable property, eg
    [feature]
    UnknownWordPenalty tuneable=true


    [feature]
    PhraseDictionaryMemory name=*TranslationModel0* num-features=4 ...
    KENLM name=*LM0* factor=0 ...

    [weight]
*TranslationModel0*= 0.2 0.2 0.2 0.2
*LM0*= 0.5

The sections other than [feature] and [weight], such as [input-factors] and [mapping], appear to preserve the v1 moses.ini format. Is this true?
yep

The order of lines in the [feature] and [weight] sections is irrelevant (as many examples have them in different orders). Also, the order of the arguments on a feature/function line is irrelevant (examples show them in different orders).
yep. However, the relative order of the phrase-tables is relevant. The [mapping] section refers to the index of the phrase-table, eg
   [mapping]
   0 T 0
   1 T 1
so if you swap the order of the phrase-tables in the [feature] section, they will refer to different phrase-tables


Finally, is there a connection between the [input-factors] section's value and the input-factor argument value for PhraseDictionaryMemory and LexicalReordering feature/functions? Or, are the similar names and corresponding values only coincidental?
Your input sentence can contain multiple factors, eg. surface|POS|lemma, in which case [input-factor] should be
  [input-factor]
  0
  1
  2
However, the phase-table may choose to only use the surface form, in which case
  [feature]
  PhraseDictionaryMemory input-factors=0

My intention is to build two scripts and contribute these scripts to the Moses project. One will convert the v2 moses.ini file to a standard form (not associated with the command line syntax) so people can easily edit the values. The other will convert the interim form back to the native v2 moses.ini format.


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

#! /usr/bin/env python
# -*- coding: utf8 -*-


from __future__ import (
	absolute_import,
	print_function,
	unicode_literals,
	)

__version__ = '1.0'
__license__ = 'LGPL3'
__source__ = 'Precision Translation Tools Pte Lte'

import errno
from sys import stdout
from copy import deepcopy
from os.path import (
	dirname,
	basename,
	exists,
	realpath,
	)
from os import (
	sep,
	makedirs,
	)

root_escape = '%(escape-prefix)s'


class moses2_to_ini(object):


	def __init__(self, inp, out, escape_prefix):
		self.inp = inp
		self.out = out
		self.escape_prefix = escape_prefix
		self._config = {}


	def parse(self):

		content = ''
		key = ''
		section = None
		self._config = {}
		counter = 0

		with open(self.inp, 'rb' ) as f:
			contents = f.read().decode('utf8')

		lines = contents.splitlines()

		# retrieve all values except feature/functions with attributes
		for i, line in [(i, line.strip()) for i, line in enumerate(lines)
						if line.strip() and not line.strip().startswith('#')]:

			if line.startswith('[') and line.endswith(']'):

				section = line.strip('] [')

				if section not in self._config.keys() + ['feature', 'weight']:
					# new section not in config and not a reserved section
					counter = 0
					key = section
					self._config[key] = {}

			elif section == 'feature' and line in ['UnknownWordPenalty',
								'WordPenalty', 'PhrasePenalty', 'Distortion']:
				# known feature/funcions without attributes
				key = '%s0' % line
				if key not in self._config:
					self._config[key] = {}
				self._config[key]['feature'] = line

			elif section == 'feature':
				# skip feature/funcions with artuments
				continue

			elif section == 'weight':
				# add weight value to feature sections
				for key, value in [(key.strip(), value.strip())
									for key, value in [line.split('=', 1)]]:
					if key not in self._config:
						self._config[key] = {}
					self._config[key]['weight'] = value

			else:
				self._config[key][counter] = line
				counter += 0

			lines[i] = ''

		# second, match feature/functions attributes to [weight] section values
		for i, line in [(i, line.strip()) for i, line in enumerate(lines)
						if line.strip() and not line.strip().startswith('#')]: 

			# add "feature" to assist creating tmpdict for feature/functions
			line = 'feature=%s' % line
			tmpdict = dict([key.split('=',1) for key in line.split()])

			# feature/functions 'name' attribute must match an entry in [weight]
			if tmpdict.get('name') not in self._config:
				raise RuntimeError('malformed moses.ini v2 file')

			for key, value in [(key.strip(), value.strip()) for key, value 
								in tmpdict.items() if key.strip() != 'name']:

				self._config[tmpdict['name']][key] = value

		return deepcopy(self._config)


	def render(self, config):

		self._config = deepcopy(config)

		_config = deepcopy(config)

		lines = _tolines(_config, self.escape_prefix)

		if self.out == '-':

			stdout.write('\n'.join(lines))

		else:

			contents = '\r\n'.join(lines)

			makedir(dirname(self.out))

			with open(self.out, 'wb') as f:
				f.write(contents.encode('utf8'))


	def __str__(self):
		return '\n'.join(_tolines(self._config, self.escape_prefix))


	@property
	def config(self):
		return deepcopy(self._config)


def _tolines(config, escape_prefix):

	lines = []

	# group feature/functions first
	for sectionname in [sectionname for sectionname in sorted(config)
									if sectionname[-1] in '0123456789']:

		section = config[sectionname]

		lines.append('[%s]' % sectionname)

		for option, value in section.items():

			if option == 'path' \
					and escape_prefix is not None \
					and value.startswith(escape_prefix):

				value = value.replace(escape_prefix, root_escape, 1)

			lines.append('%s=%s' % (option, value))

		lines.append('')

	for sectionname in [sectionname for sectionname in sorted(config)
									if sectionname[-1] not in '0123456789']:

		section = config[sectionname]

		lines.append('[%s]' % sectionname)

		for option, value in section.items():

			lines.append('%s=%s' % (option, value))

		lines.append('')

	return deepcopy(lines)


def makedir(path, mode=0o777):
	try:
		makedirs(path, mode)
	except OSError as e:
		if e.errno not in [errno.EEXIST,
							errno.EPERM, errno.EACCES, errno.ENOENT]:
			raise


def get_args():
	'''Parse command-line arguments

	Uses the API compatibility between the legacy 
	argparse.OptionParser and its replacement argparse.ArgumentParser
	for functional equivelancy and nearly identical help prompt.
	'''

	description = 'Convert Moses.ini v2 file to standard INI format'
	usage = '%s [arguments]' % basename(__file__)

	try:
		from argparse import ArgumentParser
	except ImportError:
		from optparse import OptionParser
		argparser = False
		escape_help = ('Optional. Path of SMT model. If provided, '
							'escapes \"escape-prefix\" with \"%(escape-prefix)s\"')
		parser = OptionParser(usage=usage, description=description)
		add_argument = parser.add_option
	else:
		argparser = True
		escape_help = ('Optional. Path of SMT model. If provided, '
							'escape \"escape-prefix\" with \"%%(escape-prefix)s\"')
		parser = ArgumentParser(usage=usage, description=description)
		add_argument = parser.add_argument

	add_argument('-i','--inp', action='store',
			help='moses.ini v2 file to convert (required)')

	add_argument('-o','--out', action='store', default='-',
			help='standard INI file (default: "-" outputs to stdout)')

	add_argument('-r','--escape-prefix', action='store',
			help=escape_help)

	if argparser:

		args = vars(parser.parse_args())

	else:

		opts = parser.parse_args()
		args = vars(opts[0])

	if args['inp'] is None:
		parser.error('argument -i/--inp required')

	args['inp'] = realpath(args['inp'])

	if not exists(args['inp']):
		parser.error('argument -i/--inp invalid.\n'
										'reference: %s' % args['inp'])

	if args['out'] != '-':
		args['out'] = realpath(args['out'])

	return args


if __name__ == '__main__':

	args = get_args()

	converter = moses2_to_ini(**args)

	config = converter.parse()

	converter.render(config)
#! /usr/bin/env python
# -*- coding: utf8 -*-


from __future__ import (
	absolute_import,
	print_function,
	unicode_literals,
	)

__version__ = '1.0'
__license__ = 'LGPL3'
__source__ = 'Precision Translation Tools Pte Lte'

import errno
from sys import stdout
from copy import deepcopy
from os.path import (
	dirname,
	basename,
	exists,
	realpath,
	)
from os import (
	sep,
	makedirs,
	)

root_escape = '%(escape-prefix)s'


class ini_to_moses2(object):


	def __init__(self, inp, out, escape_prefix):
		self.inp = inp
		self.out = out
		self.escape_prefix = escape_prefix
		self._config = {}


	def parse(self):

		content = ''
		lines = []
		section = None
		self._config = {}

		with open(self.inp, 'rb' ) as f:
			contents = f.read().decode('utf8')

		lines = contents.splitlines()

		for line in [[line.strip(), ''] 
						if line.strip().startswith('[') else 
					[line.strip() for line in line.split('=', 1)] 
								for line in lines if '=' in line 
										or (line.strip().startswith('[') 
											and line.strip().endswith(']'))]:

			if line[0].startswith('[') and line[0].endswith(']') and line[1] == '':

				section = line[0].strip('] [')

				if self._config.get(section, None) is None:

					self._config[section] = {}

				continue

			elif section is None:
				# skips any junk values before the first section
				continue

			self._config[section][line[0].lower()] = line[1]

		return deepcopy(self._config)


	def render(self, config):

		self._config = deepcopy(config)

		_config = deepcopy(config)

		lines = _tolines(_config, self.escape_prefix)

		if self.out == '-':

			stdout.write('\n'.join(lines))

		else:

			contents = '\r\n'.join(lines)

			makedir(dirname(self.out))

			with open(self.out, 'wb') as f:
				f.write(contents.encode('utf8'))


	def __str__(self):
		return '\n'.join(_tolines(self._config, self.escape_prefix))


	@property
	def config(self):
		return deepcopy(self._config)


def _tolines(config, escape_prefix):

	lines = []
	features = ['[feature]']
	weights = ['[weight]']

	args = ('input-factors', 'mapping', 'distortion-limit',)
	features_noattrib = frozenset(['UnknownWordPenalty',
						'WordPenalty', 'PhrasePenalty', 'Distortion',])

	# emulate order of original moses.ini v2
	for arg in args:

		tmpdict = config.pop(arg)

		if tmpdict:

			lines.append('[%s]' % arg)

			for key in sorted(tmpdict.keys()):

				lines.append(tmpdict[key])

			lines.append('')

	for key in [key for key in sorted(config.keys())
					if config[key].get('feature') in features_noattrib]:

		tmpdict = config.pop(key, {})

		weight = tmpdict.pop('weight', None)

		if weight:
			weights.append('%s=%s' % (key, weight))

		feature = tmpdict.pop('feature', None)

		if feature:
			features.append(feature)

	for key in [key for key in reversed(config.keys())
						if config[key].get('feature') is not None ]:

		tmpdict = config.pop(key, {})

		weight = tmpdict.pop('weight', None)

		if weight:
			weights.append('%s=%s' % (key, weight))

		feature = tmpdict.pop('feature', None)

		if feature:

			feature = [feature, '='.join(['name', key])]

			for attrib in sorted(tmpdict.keys()):

				if attrib == 'path' \
						and escape_prefix is not None \
						and tmpdict[attrib].startswith(root_escape):

					tmpdict[attrib] = tmpdict[attrib].replace(
									root_escape, escape_prefix, 1)

				feature.append('='.join([attrib, tmpdict[attrib]]))

			features.append(' '.join(feature))

	lines.extend(features)
	lines.append('')

	lines.extend(weights)

	if config:

		lines.append('')

		for arg in sorted(config.keys()):

			tmpdict = config[arg]

			if tmpdict:

				lines.append('[%s]' % arg)

				for key in sorted(tmpdict.keys()):

					lines.append(tmpdict[key])

				lines.append('')

	return deepcopy(lines)


def makedir(path, mode=0o777):
	try:
		makedirs(path, mode)
	except OSError as e:
		if e.errno not in [errno.EEXIST,
							errno.EPERM, errno.EACCES, errno.ENOENT]:
			raise


def get_args():
	'''Parse command-line arguments

	Uses the API compatibility between the legacy 
	argparse.OptionParser and its replacement argparse.ArgumentParser
	for functional equivelancy and nearly identical help prompt.
	'''

	description = 'Convert standard INI file to Moses.ini v2 format'
	usage = '%s [arguments]' % basename(__file__)

	try:
		from argparse import ArgumentParser
	except ImportError:
		from optparse import OptionParser
		argparser = False
		escape_help = ('Optional. Path of SMT model. If provided, '
					'unescapes \"%(escape-prefix)s\" with \"escape-prefix\"')
		parser = OptionParser(usage=usage, description=description)
		add_argument = parser.add_option
	else:
		argparser = True
		escape_help = ('Optional. Path of SMT model. If provided, '
					'escapes \"%%(escape-prefix)s\" with \"escape-prefix\"')
		parser = ArgumentParser(usage=usage, description=description)
		add_argument = parser.add_argument

	add_argument('-o','--out', action='store', default='-',
			help='moses.ini v2 file (default: "-" outputs to stdout)')

	add_argument('-i','--inp', action='store',
			help='standard INI file to convert (required)')

	add_argument('-r','--escape-prefix', action='store',
			help=escape_help)

	if argparser:

		args = vars(parser.parse_args())

	else:

		opts = parser.parse_args()
		args = vars(opts[0])

	if args['inp'] is None:
		parser.error('argument -i/--inp required')

	args['inp'] = realpath(args['inp'])

	if not exists(args['inp']):
		parser.error('argument -i/--inp invalid.\n'
										'reference: %s' % args['inp'])

	if args['out'] != '-':
		args['out'] = realpath(args['out'])

	return args


if __name__ == '__main__':

	args = get_args()

	converter = ini_to_moses2(**args)

	config = converter.parse()

	converter.render(config)
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to