[aspell-devel] Tokenization of words containing hyphens

Ciarán Ó Duibhín Fri, 21 Jun 2013 04:31:19 -0700

This is the third and last part (change #3) of my consideration of apostrophes 
and hyphens in aspell.


Languages may have words containing an internal hyphen, but with the components 
not being themselves words of the language (a possible English example is 
hotch-potch).  In such languages it is well to allow a word-internal hyphen in 
*.dat and put such "compounds" in the dictionary.  No new code is required for 
this.  However, with the change in status of the hyphen, all hyphenated 
compounds not explicitly included in the dictionary will now be rejected, even 
if their components are all in the dictionary.  To avoid this, new code is 
needed, for languages supporting internal hyphen, to examine a rejected word, 
and if it contains an internal hyphen, to check the components separately.  If 
all the components are accepted, so is the compound.  The hyphen itself will 
not be included in the separate components on either side of it.

There is something else we can do, when a hyphen is found in a token: we can 
check whether the component before AND INCLUDING the hyphen might be a known 
prefix; or whether the component after AND INCLUDING the hyphen might be a 
known suffix.  Thus the dictionary could be allowed to include prefixes 
(including a final hyphen) and suffixes (including an initial hyphen), and we 
can modify *.dat to allow this.  Code must be added to support matching of 
prefixes and suffixes, to be activated if *.dat allows initial/terminal hyphen, 
and when a rejected token contains an internal hyphen.

The extra code for processing a token containing an internal hyphen, after the 
token has been rejected as a whole, is positioned in 
modules/speller/default/speller_impl.cpp, in procedure SpellerImpl::check at 
around line 190.  The new code is placed before the checking for two words run 
together without a space, though this may not be the best place for it.  NOTE 
that I don't understand the purpose of parameters 3-6 to procedure check, or 
the corresponding parameters to procedure check2, and probably have not used 
them correctly.  But the concept is shown to work.

Here is the additional code:

    unsigned i=0;
    while (*(word+i)!= 0) {
      if ((i > 0) && (i < word_end-word-1) && (*(word+i)=='-')) {
       if (lang_->special('-').end) {  /* test up to hyphen as prefix, test 
remainder recursively as word */
          char t = *(word+i+1);
          *(word+i+1) = (char) 0;
        if (check2(word, try_uppercase, *ci, gi)) {
         *(word+i+1) = t;
         if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, 
gi))
          return true;
        }
       else
          *(word+i+1) = t;
        }
        if (lang_->special('-').middle) {  /* test up to hyphen as word, test 
remainder recursively as word, then as suffix */
         *(word+i) = (char) 0;
         if (check2(word, try_uppercase, *ci, gi)) {
          *(word+i) = '-';
          if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, 
gi))
           return true;
          else {
           if (lang_->special('-').begin) {
            if (check(word+i, word_end, try_uppercase, run_together_limit, ci, 
gi))
              return true;
           }
          }
         }
         else
          *(word+i) = '-';
        }
      }
      ++i;
    }

For this code to work as intended, change #2 is also necessary.  Consider the 
token spell-check .  We must test to see if the dictionary contains a prefix 
spell- or a suffix -check or plain words spell and check.  We would expect to 
find no such prefix or suffix, but to find the two plain words.  But unless 
change #2 is made, the token spell- will be accepted as matching the dictionary 
form spell and the process will be ended prematurely, albeit with the right 
result in this case.

As before, my experiments have been conducted using the Hatier port of aspell 
for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .  
The changes suggested in these three messages have been made to this source and 
compiled using VC++ 2005.  On the evidence so far, the changes appear to be 
working as intended, thereby solving the three problems I reported to 
aspell-user on 19 May 2013, and allowing aspell to treat the tokenization of 
apostrophes and hyphens in a similar way to the MS Word spell-checker.  As far 
as I can see, no existing functionality is adversely affected by these changes.

Ciarán Ó Duibhín

_______________________________________________
Aspell-devel mailing list
Aspell-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/aspell-devel

[aspell-devel] Tokenization of words containing hyphens

Reply via email to