Re: The SENT_END challenge

Jaume Ortolà i Font Sat, 09 Aug 2014 02:31:27 -0700

Hi,

A possible and simple solution is to write two rules. One for sentences
with ending punctuation:


    <pattern>
        <marker>
            <token regexp="yes">(you|thei|ou)r</token>
        </marker>
        <token regexp="yes">[.?!]</token>
    </pattern>

And another one for sentences without ending punctuation:

    <pattern>
        <marker>
            <token postag="SENT_END" regexp="yes">(you|thei|ou)r</token>
        </marker>
    </pattern>


They are in fact two different patterns, so it is logical to use two
different rules.

Regards,
Jaume Ortolà




2014-08-08 15:21 GMT+02:00 Juan Martorell <juan.martor...@gmail.com>:

> Hi all,
>
> I've been struggling with some rules in Spanish involving words that
> cannot close a sentence. They work fine with one exception -- when the
> sentence does not end with punctuation.
>
> To illustrate the issue, I'll contribute with a brand new rule in English.
>
> The token "to" cannot end a sentence after comma, it is very likely a typo
> and should be "too":
>
> I want to be astronaut, to.
> I am contributing, to.
>
> The rule I developed, to be included in TO_TOO group (line 2040 as of
> today) is the following:
>
>  <rule>
>      <pattern>
>         <marker>
>            <token>,</token>
>            <token>to</token>
>         </marker>
>         <token postag="SENT_END"/>
>      </pattern>
>      <message>Did you mean <suggestion>too</suggestion>?</message>
>      <short>Possible typo</short>
>      <example correction="too" type="incorrect">I want water,
> <marker>to</marker>.</example>
>      <example type="correct">That is the target we should aim to.</example>
>      <example type="correct">Time to go to bed.</example>
>  </rule>
>
> The rule works correctly:
>
> juan@ubuntu:~/languagetool/lab$ java -jar
> ../dist/languagetool-commandline.jar -c UTF8 --language en
> Expected text language: English
> Working on STDIN...
> I am tired, to.
>
> 1.) Line 1, column 11, Rule ID: TO_TOO[6]
> Message: Did you mean 'too'?
> Suggestion: too
> I am tired, to.
>           ^^^^
>
> But if you omit the full stop, then comes the trouble:
>
> juan@ubuntu:~/languagetool/lab$ java -jar
> ../dist/languagetool-commandline.jar -c UTF8 --language en -v
> Expected text language: English
> Working on STDIN...
> I am tired, to
>
> 1085 rules activated for language English
> <S> I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP]
> tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,</S>,O] <P/>
> Disambiguator log:
>
> VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] -> tired[tire/VBN,I-VP]
>
>
> And false positives arise like flowers in Spring:
>
> I want to see, to hear, to feel
>
> 1085 rules activated for language English
> <S> I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP]
> to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O]
> to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O]
> to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,</S>,I-VP] <P/>
> Disambiguator log:
>
> SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] ->
> want[want/VBP,B-VP]
>
> 1.) Line 7, column 23, Rule ID: TO_TOO[6]
> Message: Did you mean 'too'?
> Suggestion: too
> I want to see, to hear, to feel
>                       ^^^^
>
>
> Thus we have a case. Does anyone have an idea on how to address this
> challenge?
>
> Some ideas:
>
> There is a workaround in rule THE_SENT_END. It changes the last token node:
>
> <token postag="SENT_END" regexp="yes">\.|\?|!</token>
>
> This limits the false positives but detection for titles, enumerations and
> limericks is disabled.
>
> There is other rule I brewed, mixing SENT_END with punctuation:
>
> <rule id="PPR_END" name="Posesive at end of sentence" type="typographical"
> >
>     <pattern>
>         <marker>
>             <token regexp="yes">(you|thei|ou)r</token>
>         </marker>
>         <token postag="SENT_END" regexp="yes">\.|\?|!</token>
>     </pattern>
>     <message>Did you mean <suggestion>\1s</suggestion>?</message>
>     <short>Possible typo</short>
>     <example correction="yours" type="incorrect">The choice is
> <marker>your</marker>.</example>
>     <example correction="ours" type="incorrect">We brought
> <marker>our</marker>.</example>
>     <example type="correct">They brought their food.</example>
>     <example type="correct">Take your chances</example>
> </rule>
>
> This rule would also be affected.
>
> In my opinion, the key is to add a EOS token when it does not exist. At
> tokenizing time, if the delimiter is a double CR/LF, that should generate a
> separate token with lemma EOS and POS </S>. For instance,:
>
> This is your
>
> <S> This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP]
> your[your/PRP$,</S>,I-VP] <P/>
>
> would become
>
> <S> This[this/DT,B-NP-singular|E-NP-singular]
> is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,</S>,O]
> <P/>
>
> If you play around with this, you'll notice that it also affects to
> disambiguation.
>
> Please note that I won't commit the rules so I don't mess up the
> grammar.xml file. I kindly ask the maintainer to perform such commit.
>
> Best regards,
>
> Juan Martorell
>
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>

------------------------------------------------------------------------------

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: The SENT_END challenge

Reply via email to