New feature: ignoring tokens in unification

Marcin Miłkowski Thu, 27 Feb 2014 08:19:06 -0800

Hi all,

I added a new feature in the unification today: ignoring tokens. Think 
of punctuation, adverbs that do not have any gender or number, weird 
idiomatic expressions, or connectives. To silently add these to the 
unified sequence, simply use:


<unify>
    <feature id="gender"/>
    <feature id="number"/>
    <token>foo</token>
     <unify-ignore>
         <token>,</token>
     </unify-ignore>
     <token>foo</token>
</unify>


The comma will be then added without checking whether it agrees with 
other tokens (frankly, it cannot, as commas are not inflected). One 
particularly interesting use for Polish -- and probably other inflected 
languages such as German -- is that it is relatively very easy to 
identify noun groups as a sequence of agreeing tokens with a connective 
or a comma inside. In other words, this could be used to group tokens, 
not only for filtering them (alas, we don't have code for adding chunks 
just yet).

I also found and fixed a small bug in the pattern rule code: the 
unification rules were broken if there were any tokens after </unify> in 
the pattern. Now they are matched correctly.

Regards,
Marcin

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

New feature: ignoring tokens in unification

Reply via email to