Hi folks (especially Felix, Kellen, Tobi) —
I made two moderate improvements to Joshua on the way home. The first was to
get rid of all the specialized phrase handling; the packer now works as we
discussed, packing everything into Hiero format, and the stack-based decoder
uses this directly now. Everything should be backwards compatible for hiero,
but it's not for phrase-based. I added a "version = 3" line to the packer
config to distinguish this, along with a check, so the decoder will throw a
runtime exception if you try to load something incompatible. If anything fails,
instead of repacking your grammar, just add the line "version = 3" to the
packer config. The changes only affect packing for phrase-based models, so I
don't think it will matter to you. This is pushed up into master.
The bigger one is on a JOSHUA-273 branch. I just pushed up a refactoring of the
KBestExtraction / structured translation interface, per our discussions this
week. However, I wasn't actually sure how to use the API. What is the entry
point? Are you calling translate() directly and managing your own thread pools?
It doesn't seem like you would be using Decoder.decode() or decodeAll(), since
they're not very API-ish.
If you want to take a look at the changes, I'd welcome feedback, direct
changes, etc. Here is a description of the major changes:
- Large refactor of the Translation output interface
- Instead of returning Translation objects, the calls to Decoder.translate()
now return HyperGraph objects. As before, a HyperGraph represents the complete
(pruned) search space the decoder explored. A HyperGraph can then be operated
on by KBestExtractors and by the new TranslationFactory object, so that it can
be thrown away.
- KBestExtractor is now an iterator that takes a HyperGraph object and returns
DerivationState objects, each representing a single derivation tree
- Translation and StructuredTranslation are now combined. Translation is
effectively a dummy object with a number of fields of interest that get
populated by TranslationFactory, per explicit requests. Each request returns
the TranslationFactory object, so you can easily chain calls, and then retrieve
the Translation object at the end. e.g.,
KBestExtractor extractor = new KBestExtractor(hg, ...).
for (DerivationState derivation: extractor) {
TranslationFactory factory = new TranslationFactory(derivation,
...)
Translation translation = factory.alignments()
.formattedTranslation(config.outputFormat)
.features()
.translation();
}
- Neither KBestExtractors nor Translation objects do any printing. This
improved encapsulation is a big improvement over the past. After building your
Translation objects, they will contain only small objects such as strings,
feature vectors, and alignments, that can be safely passed downstream while the
HyperGraph gets destroyed. Also, code for processing and formatting is all now
in one place, the TranslationFactory.
- Also, I removed the forest rescoring and OracleExtraction classes. These are
useful but not used, and are hard to read and should therefore be rewritten. I
will do this at some point.
There are still a few things broken on the branch, but they are small and I am
working to fix them. If you have a minute to poke around on the branch, please
do, so that the end result is what you imagined when we were chatting the other
day.
matt