joshua API changes

Matt Post Wed, 25 May 2016 11:09:03 -0700

Hi folks (especially Felix, Kellen, Tobi) — 

I made two moderate improvements to Joshua on the way home. The first was to 
get rid of all the specialized phrase handling; the packer now works as we 
discussed, packing everything into Hiero format, and the stack-based decoder 
uses this directly now. Everything should be backwards compatible for hiero, 
but it's not for phrase-based. I added a "version = 3" line to the packer 
config to distinguish this, along with a check, so the decoder will throw a 
runtime exception if you try to load something incompatible. If anything fails, 
instead of repacking your grammar, just add the line "version = 3" to the 
packer config. The changes only affect packing for phrase-based models, so I 
don't think it will matter to you. This is pushed up into master.


The bigger one is on a JOSHUA-273 branch. I just pushed up a refactoring of the 
KBestExtraction / structured translation interface, per our discussions this 
week. However, I wasn't actually sure how to use the API. What is the entry 
point? Are you calling translate() directly and managing your own thread pools? 
It doesn't seem like you would be using Decoder.decode() or decodeAll(), since 
they're not very API-ish.

If you want to take a look at the changes, I'd welcome feedback, direct 
changes, etc. Here is a description of the major changes:

- Large refactor of the Translation output interface

- Instead of returning Translation objects, the calls to Decoder.translate() 
now return HyperGraph objects. As before, a HyperGraph represents the complete 
(pruned) search space the decoder explored. A HyperGraph can then be operated 
on by KBestExtractors and by the new TranslationFactory object, so that it can 
be thrown away.

- KBestExtractor is now an iterator that takes a HyperGraph object and returns 
DerivationState objects, each representing a single derivation tree

- Translation and StructuredTranslation are now combined. Translation is 
effectively a dummy object with a number of fields of interest that get 
populated by TranslationFactory, per explicit requests. Each request returns 
the TranslationFactory object, so you can easily chain calls, and then retrieve 
the Translation object at the end. e.g.,

        KBestExtractor extractor = new KBestExtractor(hg, ...).
        for (DerivationState derivation: extractor) {
                TranslationFactory factory = new TranslationFactory(derivation, 
...)
                Translation translation = factory.alignments()
                                                
.formattedTranslation(config.outputFormat)
                                                .features()
                                                .translation();
        }

- Neither KBestExtractors nor Translation objects do any printing. This 
improved encapsulation is a big improvement over the past. After building your 
Translation objects, they will contain only small objects such as strings, 
feature vectors, and alignments, that can be safely passed downstream while the 
HyperGraph gets destroyed. Also, code for processing and formatting is all now 
in one place, the TranslationFactory.

- Also, I removed the forest rescoring and OracleExtraction classes. These are 
useful but not used, and are hard to read and should therefore be rewritten. I 
will do this at some point.

There are still a few things broken on the branch, but they are small and I am 
working to fix them. If you have a minute to poke around on the branch, please 
do, so that the end result is what you imagined when we were chatting the other 
day.

matt

joshua API changes

Reply via email to