Matt Post created JOSHUA-272:
--------------------------------
Summary: Simplify the packing and usage of phrase-based grammars
Key: JOSHUA-272
URL: https://issues.apache.org/jira/browse/JOSHUA-272
Project: Joshua
Issue Type: Improvement
Reporter: Matt Post
Assignee: Matt Post
Fix For: 6.1
For historical reasons, phrase-based grammars add some complexity to decoding.
The complete tree under each top-level trie node in packed grammars has to fit
within a single packed grammars slice, which is limited to 2 GB due to
constraints on the size of Java byte[] arrays. We used to sort on just the
first item in the trie, which was a problem for phrase-based decoding, since
phrase-based rules are implemented as left-branching hierarchical rules. In
order to pack large grammars, we packed them without the leading [X,1], and
then added it when loading the grammars, both for the packed and memory-based
grammars. This was a real mess.
This was all fixed with a commit a while ago that packs and reads packed
grammars based on the first two symbols on the source side. So we should remove
all the complexity associated with phrases. They should just be regular rules.
There is also a lot of redundancy across the codebase in parsing rules,
converting them to different formats, and so on.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)