The t1x.bin's from apertium-preprocess-transfer can easily get very
large, see https://sourceforge.net/p/apertium/tickets/125/

As a simple example,

    <def-cat n="det_mod">
      <cat-item tags="prn.pers.sg.gen.@→N"/>
      <cat-item tags="prn.pers.pl.gen.@→N"/>
      <cat-item tags="prn.pers.sg.gen.@→A"/>
      <cat-item tags="prn.pers.pl.gen.@→A"/>
      <cat-item tags="prn.pers.sg.gen.ind.@→N"/>
      <cat-item tags="prn.pers.pl.gen.ind.@→N"/>
      <cat-item tags="prn.pers.sg.gen.ind.@→A"/>
      <cat-item tags="prn.pers.pl.gen.ind.@→A"/>
      <cat-item tags="prn.ref.gen.@→N"/>
      <cat-item tags="prn.ref.gen.@→A"/>
      <cat-item tags="prn.ind.@→N"/>
      <cat-item tags="prn.ind.@→A"/>
    </def-cat>

    <rule>
      <pattern>
        <pattern-item n="det_mod"/>
        <pattern-item n="det_mod"/>
        <pattern-item n="det_mod"/>
        <pattern-item n="det_mod"/>
        <pattern-item n="det_mod"/>
      </pattern>
      <action>
      </action>
    </rule>

makes the bin at least 4M large. (The sme-nob t1x.bin is now at 15M with
"real" rules, giving 3.5s startup time on my Linux machine and 9s on
Trond's Mac.)

I see from transfer_data.cc that pattern-items are compiled into
an lttoolbox transducer, but the transducer is not minimised before
writing the t1x.bin. Minimising it "solves" the space/startup cost, but
now most of the input I give it gets "^default<default>" (ie. no rule
match), or I get obviously wrong rule indexes and things like
Error in apertium-sme-nob.sme-nob.t1x: line 5264: index >= limit

I was thinking transducer minimising always should give a semantically
equal transducer, but not in this case. I'm guessing the reason
minimising gives wrong behaviour is line 28 and 31 in trx_reader.cc,
which record that the final state corresponds to this rule count:

          td.getTransducer().setFinal(*it);
          …
            td.getFinals()[*it] = count;

So transfer actually numbers the final states, interpreting the nth
final as pointing to the nth rule. E.g. we've parsed 33 rules from t1x,
then on seeing the next one, the pattern items get turned into a new FST
matching, say, "^(prn.ind|prn.ref.gen)$ ?^(prn.ind|prn.ref.gen)$" where
the state after the last "$" is, say, 102, and we record that
finals[102]=34. But after minimisation, state numbers change, and we end
up with just one final state, so this hack no longer works.

Wouldn't it be better to have a special symbol at the end of the
pattern, e.g. ɛ:RULE-34 (this would be the only symbol with an epsilon
on the left, so you know it's special). Then we could safely call
transducer.minimize().

Or are there better solutions?

Attachment: signature.asc
Description: PGP signature

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to