Re: ignoring certain tokens in rules

Marcin Miłkowski Fri, 06 May 2016 05:05:25 -0700

W dniu 06.05.2016 o 11:19, Jaume Ortolà i Font pisze:
> Hi,
>
> In fact, the problem is a bit more complicated than I expected because
> the disambiguation rules also need to ignore the tokens with quotation
> marks. So it would be necessary to add a lot of <token min="0"...>
> everywhere and it would probably be unmanageable.


The problem is that the quotation marks can also mark a beginning of a 
new nominal phrase, at least in Polish, where we don't have any articles 
before substantives. It's usual to have also generic nouns, such as 
'system' in phrases like this:

system "Super 10"

Here, "Super 10" is a proper name, and "system" is a generic noun (= 
"Super 10 system" in English). We use the generic noun in inflected 
form, and leave "Super 10" uninflected. It may happen without a 
quotation mark but usually with an upper case. With a lowercase and 
without any italics or quotation marks (or camel case), it may be very 
difficult to spot it. But in general, in Polish it's extremely rare to 
see two nouns in the same grammatical case; usually, the first one 
requires the other to be in the genitive. So, removing the quotation 
mark would require to make "Super" in "Super 10" in the genetive 
("Super" has the same form in all grammatical cases anyway).

So I imagine there may be a quite a few problems with this.

In short: Sometimes quotation marks mark a phrase boundary, sometimes 
they don't... Still, if quotation marks are available as a token 
attribute, maybe this is not such a big problem.

I also have other typographical rules that check whether quotation marks 
are used correctly, and they need access to the particular quotation 
marks used, so that would have to be dealt with.

Regards,
Marcin

>
> A more general solution:
> - In AnalyzedSentece remove tokens containing quotation marks only
> in getTokensWithoutWhitespace().
> - Add two fields to AnalyzedTokenReadings: leftQuotationMark,
> rightQuotationMark, which contain the characters adjacent to the word
> (none, one side or both sides).
> - Run everything as usually with the new
> getTokensWithoutWhitespace (disambiguation, grammar rules, etc.).
> - Retrieve leftQuotationMark, rightQuotationMark when necessary, for
> example in suggestions (i.e.).
>
> Possible difficulties:
> - GenericUnpairedBracketsRule must be modified accordingly.
> - Perhaps some grammar and disambiguation rules should know about the
> quotation marks and new attributes could be necessary (similar to
> spacebefore="yes/no").
> - Whitespaces in French.
> - Other unexpected troubles.
>
> Do you think this is a good approach?
>
> I can try to implement it, but I am not really sure if it is worthwhile
> because the problems it solves are relatively rare.
>
> Regards,
> Jaume Ortolà
>
>
>
> 2016-05-05 16:22 GMT+02:00 Jaume Ortolà i Font <jaumeort...@gmail.com
> <mailto:jaumeort...@gmail.com>>:
>
>     Hi,
>
>     I think Marcin talked about this idea some time ago.
>
>     Sometimes tokens like quotations (or other characters) should be
>     ignored in some rules. That is, the sentence should be checked as if
>     this token is not present. Any idea about how could it be implemented?
>
>     Alternatively, tokens like this one should be added to the the patterns:
>
>     [“‘”«"']
>
>     I would need to modify a few dozen rules. But perhaps this is the
>     best solution: it gives more control about the rule, the
>     suggestions, possible false alarms, and so on. what do you think?
>
>     Regards,
>     Jaume Ortolà
>
>
>
>
>
> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications Manager
> Applications Manager provides deep performance insights into multiple tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: ignoring certain tokens in rules

Reply via email to