RE: StandardTokenizer Problem

MOYSE Gilles (Cetelem) Mon, 13 Oct 2003 01:10:41 -0700

I just deleted the "ParseException.java" file, and everything works fine
now.


Thank you.

Gilles Moyse

-----Message d'origine-----
De : Erik Hatcher [mailto:[EMAIL PROTECTED]
Envoy� : samedi 11 octobre 2003 06:04
� : Lucene Users List
Objet : Re: StandardTokenizer Problem


Look at Lucene's build file (the one in CVS) and how it deals with this  
situation.  It does this:

   <target name="javacc-StandardAnalyzer" depends="init,javacc-check"  
if="javacc.present">
     <!-- generate this in a build directory so we can exclude  
ParseException -->
     <mkdir dir="${build.dir}/gen/org/apache/lucene/analysis/standard"/>
     <antcall target="invoke-javacc">
       <param name="target"  
location="src/java/org/apache/lucene/analysis/standard/ 
StandardTokenizer.jj"/>
       <param name="output.dir"  
location="${build.dir}/gen/org/apache/lucene/analysis/standard"/>
     </antcall>
     <copy todir="src/java/org/apache/lucene/analysis/standard">
       <fileset  
dir="${build.dir}/gen/org/apache/lucene/analysis/standard">
         <include name="*.java"/>
         <exclude name="ParseException.java"/>
       </fileset>
     </copy>
   </target>

which ignores the ParseException that was generated.


On Friday, October 10, 2003, at 07:48  AM, MOYSE Gilles (Cetelem) wrote:

> Hi all.
>
> I need to define my own tokenizer so as to detect accentuated  
> characters.
> So as not to modify the Lucene classes, I made a copy of the
> StandardTokenizer.jj in another package.
> Then, I modified the names (StandardTokenizer becomes MICTokenizer,  
> MIC is
> the name of my appli).
> After a JavaCC generation, I obtain the following error while compiling
> MICTokenizer.java :
>
> Exception ParseException is not compatible with throws clause in
> org.apache.lucene.analysis.TokenStream.next()
>
> The definition of the next() method in TokenStream is as follow :
>   abstract public Token next() throws IOException;
>
> whereas JavaCC, from the MICTokenizer.jj generates the following next()
> method in MICTokenizer.java
>   final public org.apache.lucene.analysis.Token next() throws
> ParseException, IOException
>
> The compiler seems right : the method next() in MICTokenizer.java does  
> not
> have the same structure.
>
> But why does this error raises with my class MICTokenizer.java  
> generated
> from MICTokenizer.jj, while the StandardTokenizer.java generated from
> StandardTokenizer.jj works well, with the same next() method signature  
> :
>   final public org.apache.lucene.analysis.Token next() throws
> ParseException, IOException
>
> If I remove the "ParseException" in the method signature, the compiler
> complains about the jj_consume_token methods called within the  
> function who
> throw "ParseException"s.
>
>
> Any help welcome.
>
> Thanks
>
> Gilles Moyse
>
> Here is a copy of my MICTokenizer.jj (for those interested, you can  
> find a
> unicode decomposition with accents at the end of the code) :
>
> /* ====================================================================
>  * The Apache Software License, Version 1.1
>  *
>  * Copyright (c) 2001 The Apache Software Foundation.  All rights
>  * reserved.
>  *
>  * Redistribution and use in source and binary forms, with or without
>  * modification, are permitted provided that the following conditions
>  * are met:
>  *
>  * 1. Redistributions of source code must retain the above copyright
>  *    notice, this list of conditions and the following disclaimer.
>  *
>  * 2. Redistributions in binary form must reproduce the above copyright
>  *    notice, this list of conditions and the following disclaimer in
>  *    the documentation and/or other materials provided with the
>  *    distribution.
>  *
>  * 3. The end-user documentation included with the redistribution,
>  *    if any, must include the following acknowledgment:
>  *       "This product includes software developed by the
>  *        Apache Software Foundation (http://www.apache.org/)."
>  *    Alternately, this acknowledgment may appear in the software  
> itself,
>  *    if and wherever such third-party acknowledgments normally appear.
>  *
>  * 4. The names "Apache" and "Apache Software Foundation" and
>  *    "Apache Lucene" must not be used to endorse or promote products
>  *    derived from this software without prior written permission. For
>  *    written permission, please contact [EMAIL PROTECTED]
>  *
>  * 5. Products derived from this software may not be called "Apache",
>  *    "Apache Lucene", nor may "Apache" appear in their name, without
>  *    prior written permission of the Apache Software Foundation.
>  *
>  * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
>  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
>  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>  * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
>  * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>  * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
>  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
>  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
>  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>  * SUCH DAMAGE.
>  * ====================================================================
>  *
>  * This software consists of voluntary contributions made by many
>  * individuals on behalf of the Apache Software Foundation.  For more
>  * information on the Apache Software Foundation, please see
>  * <http://www.apache.org/>.
>  */
>
> options {
>   STATIC = false;
> //IGNORE_CASE = true;
> //BUILD_PARSER = false;
> //UNICODE_INPUT = true;
>   USER_CHAR_STREAM = true;
>   OPTIMIZE_TOKEN_MANAGER = true;
> //DEBUG_TOKEN_MANAGER = true;
> }
> PARSER_BEGIN(MICTokenizer)
>
> package com.cetelem.outildecisionnel.mic.analysis.tokenizers;
>
> import java.io.*;
> import org.apache.lucene.analysis.standard.FastCharStream;
>
> /** A grammar-based tokenizer constructed with JavaCC.
>  *
>  * <p> This should be a good tokenizer for most European-language  
> documents.
>  *
>  * <p>Many applications have specific tokenizer needs.  If this  
> tokenizer
> does
>  * not suit your application, please consider copying this source code
>  * directory to your project and maintaining your own grammar-based
> tokenizer.
>  * Now supports accents (return a HAS_ACCENT token type) and integers
> (NUMBER token)
>  */
> public class MICTokenizer extends org.apache.lucene.analysis.Tokenizer  
> {
>
>   /** Constructs a tokenizer for this Reader. */
>   public MICTokenizer(Reader reader) {
>     this(new FastCharStream(reader));
>     this.input = reader;
>   }
> }
>
> PARSER_END(MICTokenizer)
>
> TOKEN : {                                       // token patterns
>
>   // number
>   <NUMBER: (<DIGIT>)+>
>
> | <HAS_ACCENT:                                          // at least one
digit
>       (<LETTER>)*
>       <ACCENTUATED_LETTER>
>       (<LETTER>)*
>>
>
>   // basic word: a sequence of digits & letters
> | <ALPHANUM: (<LETTER>|<DIGIT>)+ >
>
>   // internal apostrophes: O'Reilly, you're, O'Reilly's
>   // use a post-filter to remove possesives
> | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
>
>   // acronyms: U.S.A., I.B.M., etc.
>   // use a post-filter to remove dots
> | <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
>
>   // company names like AT&T and [EMAIL PROTECTED]
> | <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
>
>   // email addresses
> | <EMAIL: <ALPHANUM> "@" <ALPHANUM> ("." <ALPHANUM>)+ >
>
>   // hostname
> | <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
>
>   // floating point, serial, model numbers, ip addresses, etc.
>   // every other segment must have at least one digit
> | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>        | <HAS_DIGIT> <P> <ALPHANUM>
>        | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>        | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>        | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
>        | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
>         )
>>
>
> | <#P: ("_"|"-"|"/"|"."|",") >
>
> | <#HAS_DIGIT:                                          // at least one
digit
>     (<LETTER>|<DIGIT>)*
>     <DIGIT>
>     (<LETTER>|<DIGIT>)*
>>
>
> | < #ALPHA: (<LETTER>)+>
>
> | < #LETTER: (<NON_ACCENTUATED_LETTER>|<ACCENTUATED_LETTER>) >
>
> | < #NON_ACCENTUATED_LETTER:                                    // unicode
> letters
>       [
>        "\u0041"-"\u005a",     //upper case (A-Z)
>        "\u0061"-"\u007a",     //lower case (a-z)
>        "\u0100"-"\u1fff",     // the following letters may be conidered as
> accentuated, but they dont exist in Latin languages
>        "\u3040"-"\u318f",
>        "\u3300"-"\u337f",
>        "\u3400"-"\u3d2d",
>        "\u4e00"-"\u9fff",
>        "\uf900"-"\ufaff"
>       ]
>>
> | < #ACCENTUATED_LETTER:                                        // unicode
> letters
>       [
>        "\u00c0"-"\u00c5",     //accentuated A
>        "\u00c6",                      //AE
>        "\u00c6",                      //C cedille
>        "\u00c8"-"\u00cb",     //accentuated E
>        "\u00cc"-"\u00cf",     //accentuated I
>        "\u00d1",                      //N tilde
>        "\u00d2"-"\u00d6",     //accentuated O
>        "\u00d9"-"\u00dc",     //accentuated U
>        "\u00dd",                      //accentuated Y
>               "\u00e0"-"\u00e5",      //accentuated a
>        "\u00e8"-"\u00eb",     //accentuated e
>        "\u00cc"-"\u00cf",     //accentuated I
>        "\u00d1",                      //N tilde
>        "\u00d2"-"\u00d6",     //accentuated O
>        "\u00d9"-"\u00dc",     //accentuated U
>        "\u00dd",                      //accentuated Y
>        "\u00e0"-"\u00e5",     //accentuated a
>        "\u00e6",                      //ae
>        "\u00e6",                      //c cedille
>        "\u00e8"-"\u00eb",     //accentuated e
>        "\u00ec"-"\u00ef",     //accentuated i
>        "\u00f1",                      //N tilde
>        "\u00f2"-"\u00f6",     //accentuated o
>        "\u00f9"-"\u00fc",     //accentuated U
>        "\u00fd"-"\u00ff"      //accentuated y
>       ]
>>
> | < #DIGIT:                                     // unicode digits
>       [
>        "\u0030"-"\u0039",
>        "\u0660"-"\u0669",
>        "\u06f0"-"\u06f9",
>        "\u0966"-"\u096f",
>        "\u09e6"-"\u09ef",
>        "\u0a66"-"\u0a6f",
>        "\u0ae6"-"\u0aef",
>        "\u0b66"-"\u0b6f",
>        "\u0be7"-"\u0bef",
>        "\u0c66"-"\u0c6f",
>        "\u0ce6"-"\u0cef",
>        "\u0d66"-"\u0d6f",
>        "\u0e50"-"\u0e59",
>        "\u0ed0"-"\u0ed9",
>        "\u1040"-"\u1049"
>       ]
>>
> }
>
> SKIP : {                                        // skip unrecognized chars
>  <NOISE: ~[] >
> }
>
> /** Returns the next token in the stream, or null at EOS.
>  * <p>The returned token's type is set to an element of [EMAIL PROTECTED]
>  * MICTokenizerConstants#tokenImage}.
>  */
> org.apache.lucene.analysis.Token next() throws IOException :
> {
>   Token token = null;
> }
> {
>   ( token = <ALPHANUM> |
>     token = <APOSTROPHE> |
>     token = <ACRONYM> |
>     token = <COMPANY> |
>     token = <EMAIL> |
>     token = <HOST> |
>     token = <NUM> |
>     token = <NUMBER> |
>     token = <HAS_ACCENT> |
>     token = <EOF>
>    )
>     {
>       if (token.kind == EOF) {
>       return null;
>       } else {
>       return
>         new org.apache.lucene.analysis.Token(token.image,
>                                       token.beginColumn,token.endColumn,
>                                       tokenImage[token.kind]);
>       }
>     }
> }


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: StandardTokenizer Problem

Reply via email to