Hi, A possible and simple solution is to write two rules. One for sentences with ending punctuation:
<pattern> <marker> <token regexp="yes">(you|thei|ou)r</token> </marker> <token regexp="yes">[.?!]</token> </pattern> And another one for sentences without ending punctuation: <pattern> <marker> <token postag="SENT_END" regexp="yes">(you|thei|ou)r</token> </marker> </pattern> They are in fact two different patterns, so it is logical to use two different rules. Regards, Jaume Ortolà 2014-08-08 15:21 GMT+02:00 Juan Martorell <juan.martor...@gmail.com>: > Hi all, > > I've been struggling with some rules in Spanish involving words that > cannot close a sentence. They work fine with one exception -- when the > sentence does not end with punctuation. > > To illustrate the issue, I'll contribute with a brand new rule in English. > > The token "to" cannot end a sentence after comma, it is very likely a typo > and should be "too": > > I want to be astronaut, to. > I am contributing, to. > > The rule I developed, to be included in TO_TOO group (line 2040 as of > today) is the following: > > <rule> > <pattern> > <marker> > <token>,</token> > <token>to</token> > </marker> > <token postag="SENT_END"/> > </pattern> > <message>Did you mean <suggestion>too</suggestion>?</message> > <short>Possible typo</short> > <example correction="too" type="incorrect">I want water, > <marker>to</marker>.</example> > <example type="correct">That is the target we should aim to.</example> > <example type="correct">Time to go to bed.</example> > </rule> > > The rule works correctly: > > juan@ubuntu:~/languagetool/lab$ java -jar > ../dist/languagetool-commandline.jar -c UTF8 --language en > Expected text language: English > Working on STDIN... > I am tired, to. > > 1.) Line 1, column 11, Rule ID: TO_TOO[6] > Message: Did you mean 'too'? > Suggestion: too > I am tired, to. > ^^^^ > > But if you omit the full stop, then comes the trouble: > > juan@ubuntu:~/languagetool/lab$ java -jar > ../dist/languagetool-commandline.jar -c UTF8 --language en -v > Expected text language: English > Working on STDIN... > I am tired, to > > 1085 rules activated for language English > <S> I[I/PRP,B-NP-singular|E-NP-singular] am[be/VBP,B-VP] > tired[tire/VBN,I-VP],[,/,,O] to[to/IN,to/TO,</S>,O] <P/> > Disambiguator log: > > VBN_VBD:1 tired[tired/JJ,tire/VBD,tire/VBN,I-VP] -> tired[tire/VBN,I-VP] > > > And false positives arise like flowers in Spring: > > I want to see, to hear, to feel > > 1085 rules activated for language English > <S> I[I/PRP,B-NP-singular|E-NP-singular] want[want/VBP,B-VP] > to[to/IN,to/TO,I-VP] see[see/NN,see/VB,see/VBP,I-VP],[,/,,O] > to[to/IN,to/TO,B-VP] hear[hear/VB,hear/VBP,I-VP],[,/,,O] > to[to/IN,to/TO,B-VP] feel[feel/NN,feel/VB,feel/VBP,</S>,I-VP] <P/> > Disambiguator log: > > SENT_START_PRP_VB_NN:1 want[want/NN:UN,want/VB,want/VBP,B-VP] -> > want[want/VBP,B-VP] > > 1.) Line 7, column 23, Rule ID: TO_TOO[6] > Message: Did you mean 'too'? > Suggestion: too > I want to see, to hear, to feel > ^^^^ > > > Thus we have a case. Does anyone have an idea on how to address this > challenge? > > Some ideas: > > There is a workaround in rule THE_SENT_END. It changes the last token node: > > <token postag="SENT_END" regexp="yes">\.|\?|!</token> > > This limits the false positives but detection for titles, enumerations and > limericks is disabled. > > There is other rule I brewed, mixing SENT_END with punctuation: > > <rule id="PPR_END" name="Posesive at end of sentence" type="typographical" > > > <pattern> > <marker> > <token regexp="yes">(you|thei|ou)r</token> > </marker> > <token postag="SENT_END" regexp="yes">\.|\?|!</token> > </pattern> > <message>Did you mean <suggestion>\1s</suggestion>?</message> > <short>Possible typo</short> > <example correction="yours" type="incorrect">The choice is > <marker>your</marker>.</example> > <example correction="ours" type="incorrect">We brought > <marker>our</marker>.</example> > <example type="correct">They brought their food.</example> > <example type="correct">Take your chances</example> > </rule> > > This rule would also be affected. > > In my opinion, the key is to add a EOS token when it does not exist. At > tokenizing time, if the delimiter is a double CR/LF, that should generate a > separate token with lemma EOS and POS </S>. For instance,: > > This is your > > <S> This[this/DT,B-NP-singular|E-NP-singular] is[be/VBZ,B-VP] > your[your/PRP$,</S>,I-VP] <P/> > > would become > > <S> This[this/DT,B-NP-singular|E-NP-singular] > is[be/VBZ,B-VP] your[your/PRP$,B-NP-singular|E-NP-singular]EOS[null,</S>,O] > <P/> > > If you play around with this, you'll notice that it also affects to > disambiguation. > > Please note that I won't commit the rules so I don't mess up the > grammar.xml file. I kindly ask the maintainer to perform such commit. > > Best regards, > > Juan Martorell > > > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > >
------------------------------------------------------------------------------
_______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel