The t1x.bin's from apertium-preprocess-transfer can easily get very large, see https://sourceforge.net/p/apertium/tickets/125/
As a simple example, <def-cat n="det_mod"> <cat-item tags="prn.pers.sg.gen.@→N"/> <cat-item tags="prn.pers.pl.gen.@→N"/> <cat-item tags="prn.pers.sg.gen.@→A"/> <cat-item tags="prn.pers.pl.gen.@→A"/> <cat-item tags="prn.pers.sg.gen.ind.@→N"/> <cat-item tags="prn.pers.pl.gen.ind.@→N"/> <cat-item tags="prn.pers.sg.gen.ind.@→A"/> <cat-item tags="prn.pers.pl.gen.ind.@→A"/> <cat-item tags="prn.ref.gen.@→N"/> <cat-item tags="prn.ref.gen.@→A"/> <cat-item tags="prn.ind.@→N"/> <cat-item tags="prn.ind.@→A"/> </def-cat> <rule> <pattern> <pattern-item n="det_mod"/> <pattern-item n="det_mod"/> <pattern-item n="det_mod"/> <pattern-item n="det_mod"/> <pattern-item n="det_mod"/> </pattern> <action> </action> </rule> makes the bin at least 4M large. (The sme-nob t1x.bin is now at 15M with "real" rules, giving 3.5s startup time on my Linux machine and 9s on Trond's Mac.) I see from transfer_data.cc that pattern-items are compiled into an lttoolbox transducer, but the transducer is not minimised before writing the t1x.bin. Minimising it "solves" the space/startup cost, but now most of the input I give it gets "^default<default>" (ie. no rule match), or I get obviously wrong rule indexes and things like Error in apertium-sme-nob.sme-nob.t1x: line 5264: index >= limit I was thinking transducer minimising always should give a semantically equal transducer, but not in this case. I'm guessing the reason minimising gives wrong behaviour is line 28 and 31 in trx_reader.cc, which record that the final state corresponds to this rule count: td.getTransducer().setFinal(*it); … td.getFinals()[*it] = count; So transfer actually numbers the final states, interpreting the nth final as pointing to the nth rule. E.g. we've parsed 33 rules from t1x, then on seeing the next one, the pattern items get turned into a new FST matching, say, "^(prn.ind|prn.ref.gen)$ ?^(prn.ind|prn.ref.gen)$" where the state after the last "$" is, say, 102, and we record that finals[102]=34. But after minimisation, state numbers change, and we end up with just one final state, so this hack no longer works. Wouldn't it be better to have a special symbol at the end of the pattern, e.g. ɛ:RULE-34 (this would be the only symbol with an epsilon on the left, so you know it's special). Then we could safely call transducer.minimize(). Or are there better solutions?
signature.asc
Description: PGP signature
------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff