RE: '-' character not interpreted correctly in field names

Eric Isakson Wed, 09 Jul 2003 10:23:26 -0700

You left out the ~ character in your _FIELDNAME_START_CHAR production. That character 
tells the grammar that it should take all the characters except the ones you specified 
(the complement).


Change:

| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", 

To:

| <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", 

and it should probably work.

Eric

-----Original Message-----
From: Victor Hadianto [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 09, 2003 4:53 AM
To: Lucene Users List
Subject: Re: '-' character not interpreted correctly in field names


Hi Erik and others,

I'm looking for a similar solution where I need QueryParser not to drop the 
"-" characters from the field name. Hower outside the field I do want the - 
sign interpreted as "not" modifier. 

I'm definitely not an expert in JavaCC and to be honest I only have a limited 
idea about Erik's suggestion work,

Anyway I followed the suggestion and added the following:

| <#_WHITESPACE: ( " " | "\t" ) >
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":", 
| "^",
                               "[", "]", "\"", "{", "}", "~", "*", "?" ]
                             | <_ESCAPED_CHAR> ) >
| <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >

and again below I added:


| <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
| <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)*  >

And I changed:

    LOOKAHEAD(2)
    fieldToken=<TERM> <COLON> { field = fieldToken.image; }

to: ...

    LOOKAHEAD(2)
    fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }


Well after doing all this mods all the query that involved field names cause 
problem, for example if I searched for

fieldname:hello

The query is blank (yes blank, nothing in it)

and if the fieldname does contain a dash ("-") for example: field-name:hello

They query is: +field -name

hello is dropped.


Does anyone has any idea? Help and suggestions will be much appreciated. I 
really need to get this dash working, changing the field name will be my last 
resort which I won't explore until I really have to.


Thanks,

Victor


On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> I think the query parser changes would not be too bad, I've outlined a 
> couple of relavant lines you should look at so you don't have to try 
> and comprehend the productions for the entire QueryParser. I do not 
> think I would like to have to maintain one of those myself though. 
> Your other unmentioned alternative is to choose field names that match 
> the <TERM> production of QueryParser.jj without escapes.
>
> QueryParser.jj line 557:
>     fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
> and earlier...
>  <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
>                           "[", "]", "\"", "{", "}", "~", "*", "?" ] >
>
> | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", 
> | "^",
>
>                            "[", "]", "\"", "{", "}", "~", "*", "?" ]
>
>                        | <_ESCAPED_CHAR> ) >
> |
> | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> ...
>
> <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
>
> So the characters you need to avoid in your field names are the ones 
> from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[", 
> "]", "\"", "{", "}", "~", "*", "?" ]
>
> If you need to modify the parser, you will probably want to add a 
> FIELDNAME token and other supporting productions that look really 
> similar to these lines I've copied but modify the complement, ~[...], 
> at the beginning of _FIELDNAME_START_CHAR (you would add this 
> production) so it will match the "-" that you are using in your field 
> names (and fix it to match any other characters you want to use in 
> field names that it doesn't allow right now).
>
> Eric
>
> -----Original Message-----
> From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 14, 2003 2:26 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
> Eric Isakson wrote:
> > I just looked at the QueryParser.jj code, your field names
> >
>  > never get processed by the analyzer. It does look like the  > query 
> parser will honor escapes though. I haven't tried  > this, but try a 
> query like "foo\-bar:foo" and have
> >
> > a look at the QueryParser.jj file for how it handles field
> >
>  > names when parsing your query.
>
> Hrm.. that's what I had found too.  So, you're saying that, other than 
> escaping dashes, I'd have to change QueryParser.. ?
>
> I'm not too familiar just yet with JavaCC syntax, so reading through 
> QueryParser is a little tough going.  Thanks Eric,
>
> jp
>
> > -----Original Message-----
> > From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> > Sent: Monday, May 12, 2003 4:03 PM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Otis, Terry,
> >
> >  >>>You can write a custom Analyzer that does not remove dashes from
> > >>>
> > >>>tokens, and use it for both indexing and searching.  >>>  >>>This
> >
> > is a frequent question and answer on this list.
> >
> > Sorry for the noise, but I haven't been able to find a solution in 
> > the mailing list archives, or by writing my own analyzer:
> >
> >     public class MyAnalyzer extends Analyzer {
> >     public TokenStream tokenStream(String fieldName, Reader reader)                
> >  {
> >             return new CharTokenizer(reader) {
> >                     protected boolean isTokenChar(char c) {
> >                             return Character.isLetter(c) || c == '-';
> >                     }
> >             };
> >     }
> >     }
> >
> > I parse a query like this:
> >
> >     String queryString = "foo-bar:foo";
> >     String queryResult =
> >             QueryParser.parse(queryString, "body", new MyAnalyzer())
> >
> > With the output:
> >     body:foo -bar:foo
> >
> > But I would expect the output:
> >      foo-bar:foo
> >
> > If I print out the tokens that MyAnalyzer produces I do get 
> > "foo-bar" and then "foo".
> >
> > Any pointers on what I'm doing wrong?
> >
> > jp
> >
> >>>>--- Jon Pipitone <[EMAIL PROTECTED]> wrote:
> >>>>>Hi all,
> >>>>>
> >>>>>>I believe that the tokenizer treats a dash as a token
> >>>
> >>>separator.
> >>>
> >>>>>>Hence, the only way, as I recall, to eliminate this behavior
> >>>
> >>>is
> >>>
> >>>>>>to modify QueryParser.jj so it doesn't do this.  However,
> >>>
> >>>doing
> >>>
> >>>>>>this can cause some other problems, like hyphenated words at a 
> >>>>>>line break and the like.
> >>>>>
> >>>>>I've recently started using lucene and I'm running into the same 
> >>>>>issue with the query parser.  I'd like to use queries that 
> >>>>>contain
> >>>
> >>>dashes
> >>>
> >>>>>in
> >>>>>the field name, but as far as I can tell it seems that the
> >>>
> >>>current
> >>>
> >>>>>query
> >>>>>grammar treats field names as terms, and so, as Terry notes, a
> >>>
> >>>dash
> >>>
> >>>>>becomes a token seperator.
> >>>>>
> >>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by 
> >>>>>creating a seperate non-terminal for field names.
> >>>>>
> >>>>>Has anyone done any work on this already?  Is modifying 
> >>>>>QueryParser.jj the best approach?
> >>>>>
> >>>>>Thanks,
> >>>>>jp



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: '-' character not interpreted correctly in field names

Reply via email to