Aaron Rubin <[email protected]>
writes:
> Hi all,
>
> I've spoken with Francis a few times over e-mail and been on IRC a bit, but I
> don't
> think I've introduced myself to the whole listhost. I'm a third-year student
> at the
> University of Chicago, majoring in linguistics with a minor in Comp Sci. Most
> of my
> programming experience is doing various analyses of text files in C, so it
> seemed that
> of all the project ideas, the lint tester for suspicious constructs in .dix
> files would
> be the best for me (I thought about proposing a Japanese-English language
> pair, but
> Google does a fairly OK job with Japanese as it is, and there's no way I
> could surpass
> that in three months). I've already written a duplicate tag checker in C and
> sent it out
> to the listhost earlier today, and I've been thinking about how I'd implement
> some of
> the other suggestions on the lint tester ideas page, as well as a few ideas
> of my own.
> The problem, though, is that I'm not sure how I'd be able to fill up the
> whole summer
> doing it! This is my tentative schedule:
>
> Week 1: Redundant Entry Finder
> Week 2: Testing Full Entries in Lemmas where Part of the Lemma is Specified
> by the
> Pardef
> Week 3: Testing Misspelled Tags and Pardefs
> Week 4: Testing Incompatible Tags (multiple gender tags instead of combined
> tags for
> nouns of ambiguous gender, multiple number tags, a "noun" and "adj" tag on
> the same
> entry)
> Week 5: Testing Tag Missing on One Side of Translation Equivalents (a "noun"
> tag on the
> English side, but not on the Spanish side)
> Week 6: Testing Missing Gender on Gendered Languages (this would be an
> intricate one...
> I'd have to investigate which of the languages in the language pairs have
> gender or noun
> class systems and have the program take that into account)
>
> But not all of those would necessarily take up a week, and there's no way
> that all of
> this will take 12 weeks! So I've been thinking about common errors that might
> show up in
> transfer rules files, but nothing's really come to mind. Has anyone else
> noticed common
> mistakes in .dix or transfer rules files that would be suitable for this kind
> of program
> to look for?
Say you're editing a transfer file that has
<def-attr n="a_det">
<attr-item tags="det"/>
<attr-item tags="det.emph"/>
<attr-item tags="det.dem"/>
<attr-item tags="det.itg"/>
<attr-item tags="det.qnt"/>
<attr-item tags="det.pos"/>
</def-attr>
…
<not>
<equal>
<clip pos="1" side="tl" part="a_det"/> <lit v=""/>
</equal>
</not>
(ie. it's not a determiner at all) and you want to make it a more
specific requirement, like "it has to be the tag sequence <det><pos>".
It's easy to leave out the -tag and write
<not>
<equal>
<clip pos="1" side="tl" part="a_det"/> <lit v="det.pos"/>
</equal>
</not>
where the correct version would be
<not>
<equal>
<clip pos="1" side="tl" part="a_det"/> <lit-tag v="det.pos"/>
</equal>
</not>
or to write det.poss or something, which would never match since it's
not defined in a_det. Here you could give a warning if the user tests
for a def-attr-defined clip being anything other than 1) empty, 2) a
tag/tag sequence from the def-attr, or 3) a variable.
There are also default clips not defined in def-attr, like "lemh",
"lemq", "lem", that can contain empty or non-empty lit's, but never
tags.
I guess you could also do the same for <begins-with> instead of <equal>.
You could probably also warn about
<in>
<clip part="a_det"/>
<list n="some-list-that-is-disjoint-from-a_det"/>
</in>
And then there's calling a macro with the wrong amount of arguments; the
various vm for transfer compilers show this check, but the standard one
does not, so it wouldn't hurt to put it in.
-Kevin
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff