Re: Inflecting second token with postag from the first

Andriy Rysin Tue, 13 Sep 2016 20:05:19 -0700

I did a quick POC for allowing back reference in match lemmas. The (rough)
patch is attached.


I must say I have mixed feelings about this: the logic in match state is
quite complicated, I've modified the code path that works for my rule and
also for the new test I've added. But tracing and testing all possible
scenarios there will be quite time consuming for person that didn't write
the logic. On the other hand currently \\[\d] leads to infinite loop so it
would be nice to do something about it.
I would say if we decide not to implement this feature we want to add a
sanity check and if there's \\[\d] iniside <match> we should throw an error
saying it's not supported.

Regards,
Andriy

P.S. Dominique is correct - the trick is that resulting postag is also
regex and the tags are not symmertical; but to be complete I should have
modified my rule a bit so it would also catch when fname has extra tags
lname does not, like this (without $2):
postag="(noun.*:m.*:)fname(.*)" postag_replace="$1lname.*"
the actual regexp you want highly depends on the tags you have though

2016-09-13 18:05 GMT-04:00 Dominique Pellé <dominique.pe...@gmail.com>:

> Jesper wrote:
>
> > It looks very strange to me to include ".*" in a replacement expression.
>
> I understand that it looks strange.  But in some cases, the result of
> replacement
> is a regexp. That's why regexp syntax can appear inside the
> regexp_replace="....".
>
> I see other examples in:
>
> - the Polish grammar.xml in rule "RYMY":
>
> <match no="0" regexp_match=".*([aeuóiyęą][^aeuóiyęą]+[aeuóiyęą]+[^aeuóiyęą]*)"
> regexp_replace=".*$1"></match>
>
> - the Breton grammar.xml in rule  KLANV_PE_GLANVOCH:
>
> <match no="0" regexp_match="^.(.*)" regexp_replace="(k|c’h)$1oc’h"/>
> <match no="0" regexp_match=".(.*)" regexp_replace="[tz]$1oc’h"/>
> <match no="0" regexp_match=".(.*)" regexp_replace="[pf]$1oc’h"/>
> <match no="0" regexp_match=".(.*)" regexp_replace="[gk]$1oc’h"/>
>
> I cannot tell whether the Polish one is OK, but the Breton replacements
> look OK - I wrote them :-)  - and the tests pass.
>
> Regards
> Dominique
>
>
> Jesper Hertel <jesper.her...@gmail.com> wrote:
>
> > Yes, but I think those would be included in the $2 which catches
> everything
> > after fname (.*).
> > It looks very strange to me to include ".*" in a replacement expression.
> >
> > But now I stated my observation so it is up to you if you want to go into
> > it.
> >
> > Best,
> > Jesper
> >
> >
> > 2016-09-13 23:21 GMT+02:00 Andriy Rysin <ary...@gmail.com>:
> >>
> >> There are some cases where last name would have extra tags.
> >>
> >> Regards,
> >> Andriy
> >>
> >>
> >> On Sep 13, 2016 5:05 PM, "Jesper Hertel" <jesper.her...@gmail.com>
> wrote:
> >>>
> >>> Hi Andriy,
> >>>
> >>> As a beginner in LanguageTool I know almost nothing about this, but I
> do
> >>> have a few decades of experience in regular expressions, and the .*
> looks
> >>> strange to me in a replacement expression:
> >>>
> >>> postag_replace="$1lname$2.*"
> >>>
> >>> Are you sure it shouldn't simply be
> >>>
> >>> postag_replace="$1lname$2"
> >>>
> >>> ?
> >>>
> >>> Best,
> >>> Jesper
> >>>
> >>>
> >>>
> >>> 2016-09-13 22:27 GMT+02:00 Andriy Rysin <ary...@gmail.com>:
> >>>>
> >>>> Sorry if this is already written somewhere - I looked at wiki pages
> but
> >>>> could not find anything relevant.
> >>>>
> >>>> I have two tokens (first name and last name) and in the suggestion I
> >>>> want to inflect second token the same as the first. I tried to do
> this:
> >>>>
> >>>> <suggestion><match no="1" postag_regexp="yes"
> >>>> postag="(noun.*:m.*:)fname(.*)"
> >>>> postag_replace="$1lname$2.*">\2</match></suggestion>
> >>>>
> >>>> but it sends the tests into 100% CPU loop and I don't have access to
> my
> >>>> Eclipse to try to debug this.
> >>>>
> >>>> Is there a right way to do this? If not does it make sense to look why
> >>>> we deadloop with logic above and try to fix it?
> >>>>
> >>>> Thanks
> >>>> Andriy
> >>>>
> >>>> P.S. I have similar logic for token inflection agreement in Java rules
> >>>> but it's pretty heavy and this was a simple case I thought I could do
> in xml
> >>>>
> >>>>
> >>>> ------------------------------------------------------------
> ------------------
> >>>>
> >>>> _______________________________________________
> >>>> Languagetool-devel mailing list
> >>>> Languagetool-devel@lists.sourceforge.net
> >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >>>>
> >>>
> >>>
> >>>
> >>> ------------------------------------------------------------
> ------------------
> >>>
> >>> _______________________________________________
> >>> Languagetool-devel mailing list
> >>> Languagetool-devel@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >>>
> >>
> >>
> >> ------------------------------------------------------------
> ------------------
> >>
> >> _______________________________________________
> >> Languagetool-devel mailing list
> >> Languagetool-devel@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >>
> >
> >
> > ------------------------------------------------------------
> ------------------
> >
> > _______________________________________________
> > Languagetool-devel mailing list
> > Languagetool-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>

diff --git a/languagetool-core/src/main/java/org/languagetool/rules/patterns/MatchState.java b/languagetool-core/src/main/java/org/languagetool/rules/patterns/MatchState.java
index bcebf8a..41b19c7 100644
--- a/languagetool-core/src/main/java/org/languagetool/rules/patterns/MatchState.java
+++ b/languagetool-core/src/main/java/org/languagetool/rules/patterns/MatchState.java
@@ -45,6 +45,7 @@ import static org.languagetool.JLanguageTool.SENTENCE_START_TAGNAME;
  * @since 2.3
  */
 public class MatchState {
+  private static final Pattern BACKSLASH_REF_PATTERN = Pattern.compile("\\\\[1-9]");
 
   private final Match match;
   private final Synthesizer synthesizer;
@@ -79,6 +80,9 @@ public class MatchState {
    * @param next Position of the next token (the skipped tokens are the ones between the tokens[index] and tokens[next]
    */
   public final void setToken(AnalyzedTokenReadings[] tokens, int index, int next) {
+    
+    replaceBackRef(tokens, index);
+    
     int idx = index;
     if (index >= tokens.length) {
       // TODO: hacky workaround, find a proper solution. See EnglishPatternRuleTest.testBug()
@@ -102,6 +106,28 @@ public class MatchState {
     } else {
       skippedTokens = "";
     }
+
+  }
+
+  private void replaceBackRef(AnalyzedTokenReadings[] tokens, int index) {
+    String lemma = match.getLemma();
+    if( formattedToken != null && lemma != null && BACKSLASH_REF_PATTERN.matcher(lemma).matches() ) {
+      String refNumStr = lemma.substring(1, 2);
+      int refNum = Integer.parseInt(refNumStr);
+
+      //TODO: validate
+      if( refNum == match.getTokenRef() + 1 )
+        throw new IllegalArgumentException("Circular backref in the match " + match.getTokenRef() + 1);
+
+      int backRefIndex = index + refNum - match.getTokenRef() - 1;
+      if( backRefIndex >= tokens.length )
+        throw new IllegalArgumentException("Invalid backref in the match " + match.getTokenRef() + 1 + ": \\" + refNumStr);
+      
+      // TODO: how do we treat multiple lemmas???
+      String newLemma = tokens[backRefIndex].getAnalyzedToken(0).getLemma();
+      
+      formattedToken = new AnalyzedTokenReadings(new AnalyzedToken(lemma, match.getPosTag(), newLemma), 0);
+    }
   }
 
   public final AnalyzedTokenReadings filterReadings() {
@@ -385,4 +411,6 @@ public class MatchState {
   public Match getMatch() {
     return match;
   }
+  
+  
 }
diff --git a/languagetool-language-modules/pl/src/test/java/org/languagetool/rules/pl/MatchTest.java b/languagetool-language-modules/pl/src/test/java/org/languagetool/rules/pl/MatchTest.java
index 8565b64..89e4378 100644
--- a/languagetool-language-modules/pl/src/test/java/org/languagetool/rules/pl/MatchTest.java
+++ b/languagetool-language-modules/pl/src/test/java/org/languagetool/rules/pl/MatchTest.java
@@ -83,4 +83,22 @@ public class MatchTest {
     matchState = new MatchState(match, polish.getSynthesizer());
     assertEquals("[ASEAN-u]", Arrays.toString(matchState.toFinalString(polish)));
   }
+  
+  @Test
+  public void testBackRefsInMatch() throws Exception {
+    //tests with synthesizer
+    Match match = getMatch("^(.*)$", "subst:sg:gen:m3", true);
+    match.setLemmaString("\\2");
+    Polish polish = new Polish();
+    
+    MatchState matchState = new MatchState(match, polish.getSynthesizer());
+    
+    matchState.setToken(new AnalyzedTokenReadings[] {
+        getAnalyzedTokenReadings("AON", "subst:sg:acc.nom:m3", "AON"),
+        getAnalyzedTokenReadings("\\2", "subst:sg:acc.nom:m3", "\\2"), 
+        getAnalyzedTokenReadings("AON", "subst:sg:acc.nom:m3", "AON")
+    }, 1, 2);
+
+    assertEquals("[AON-u]", Arrays.toString(matchState.toFinalString(polish)));
+  }
 }

------------------------------------------------------------------------------

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Inflecting second token with postag from the first

Reply via email to