You left out the ~ character in your _FIELDNAME_START_CHAR production. That character
tells the grammar that it should take all the characters except the ones you specified
(the complement).
Change:
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
To:
| <#_FIELDNAME_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
and it should probably work.
Eric
-----Original Message-----
From: Victor Hadianto [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 09, 2003 4:53 AM
To: Lucene Users List
Subject: Re: '-' character not interpreted correctly in field names
Hi Erik and others,
I'm looking for a similar solution where I need QueryParser not to drop the
"-" characters from the field name. Hower outside the field I do want the -
sign interpreted as "not" modifier.
I'm definitely not an expert in JavaCC and to be honest I only have a limited
idea about Erik's suggestion work,
Anyway I followed the suggestion and added the following:
| <#_WHITESPACE: ( " " | "\t" ) >
| <#_FIELDNAME_START_CHAR: ( [ " ", "\t", "+", "-", "!", "(", ")", ":",
| "^",
"[", "]", "\"", "{", "}", "~", "*", "?" ]
| <_ESCAPED_CHAR> ) >
| <#_FIELDNAME_CHAR: ( <_FIELDNAME_START_CHAR> | <_ESCAPED_CHAR> ) >
and again below I added:
| <TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
| <FIELDNAME: <_FIELDNAME_START_CHAR> (<_FIELDNAME_CHAR>)* >
And I changed:
LOOKAHEAD(2)
fieldToken=<TERM> <COLON> { field = fieldToken.image; }
to: ...
LOOKAHEAD(2)
fieldToken=<FIELDNAME> <COLON> { field = fieldToken.image; }
Well after doing all this mods all the query that involved field names cause
problem, for example if I searched for
fieldname:hello
The query is blank (yes blank, nothing in it)
and if the fieldname does contain a dash ("-") for example: field-name:hello
They query is: +field -name
hello is dropped.
Does anyone has any idea? Help and suggestions will be much appreciated. I
really need to get this dash working, changing the field name will be my last
resort which I won't explore until I really have to.
Thanks,
Victor
On Thu, 15 May 2003 04:54 am, Eric Isakson wrote:
> I think the query parser changes would not be too bad, I've outlined a
> couple of relavant lines you should look at so you don't have to try
> and comprehend the productions for the entire QueryParser. I do not
> think I would like to have to maintain one of those myself though.
> Your other unmentioned alternative is to choose field names that match
> the <TERM> production of QueryParser.jj without escapes.
>
> QueryParser.jj line 557:
> fieldToken=<TERM> <COLON> { field = fieldToken.image; }
>
> and earlier...
> <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
> "[", "]", "\"", "{", "}", "~", "*", "?" ] >
>
> | <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":",
> | "^",
>
> "[", "]", "\"", "{", "}", "~", "*", "?" ]
>
> | <_ESCAPED_CHAR> ) >
> |
> | <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
>
> ...
>
> <TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
>
> So the characters you need to avoid in your field names are the ones
> from _ESCAPED_CHAR, [ "\\", "+", "-", "!", "(", ")", ":", "^", "[",
> "]", "\"", "{", "}", "~", "*", "?" ]
>
> If you need to modify the parser, you will probably want to add a
> FIELDNAME token and other supporting productions that look really
> similar to these lines I've copied but modify the complement, ~[...],
> at the beginning of _FIELDNAME_START_CHAR (you would add this
> production) so it will match the "-" that you are using in your field
> names (and fix it to match any other characters you want to use in
> field names that it doesn't allow right now).
>
> Eric
>
> -----Original Message-----
> From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, May 14, 2003 2:26 PM
> To: Lucene Users List
> Subject: Re: '-' character not interpreted correctly in field names
>
> Eric Isakson wrote:
> > I just looked at the QueryParser.jj code, your field names
> >
> > never get processed by the analyzer. It does look like the > query
> parser will honor escapes though. I haven't tried > this, but try a
> query like "foo\-bar:foo" and have
> >
> > a look at the QueryParser.jj file for how it handles field
> >
> > names when parsing your query.
>
> Hrm.. that's what I had found too. So, you're saying that, other than
> escaping dashes, I'd have to change QueryParser.. ?
>
> I'm not too familiar just yet with JavaCC syntax, so reading through
> QueryParser is a little tough going. Thanks Eric,
>
> jp
>
> > -----Original Message-----
> > From: Jon Pipitone [mailto:[EMAIL PROTECTED]
> > Sent: Monday, May 12, 2003 4:03 PM
> > To: Lucene Users List
> > Subject: Re: '-' character not interpreted correctly in field names
> >
> >
> > Hi Otis, Terry,
> >
> > >>>You can write a custom Analyzer that does not remove dashes from
> > >>>
> > >>>tokens, and use it for both indexing and searching. >>> >>>This
> >
> > is a frequent question and answer on this list.
> >
> > Sorry for the noise, but I haven't been able to find a solution in
> > the mailing list archives, or by writing my own analyzer:
> >
> > public class MyAnalyzer extends Analyzer {
> > public TokenStream tokenStream(String fieldName, Reader reader)
> > {
> > return new CharTokenizer(reader) {
> > protected boolean isTokenChar(char c) {
> > return Character.isLetter(c) || c == '-';
> > }
> > };
> > }
> > }
> >
> > I parse a query like this:
> >
> > String queryString = "foo-bar:foo";
> > String queryResult =
> > QueryParser.parse(queryString, "body", new MyAnalyzer())
> >
> > With the output:
> > body:foo -bar:foo
> >
> > But I would expect the output:
> > foo-bar:foo
> >
> > If I print out the tokens that MyAnalyzer produces I do get
> > "foo-bar" and then "foo".
> >
> > Any pointers on what I'm doing wrong?
> >
> > jp
> >
> >>>>--- Jon Pipitone <[EMAIL PROTECTED]> wrote:
> >>>>>Hi all,
> >>>>>
> >>>>>>I believe that the tokenizer treats a dash as a token
> >>>
> >>>separator.
> >>>
> >>>>>>Hence, the only way, as I recall, to eliminate this behavior
> >>>
> >>>is
> >>>
> >>>>>>to modify QueryParser.jj so it doesn't do this. However,
> >>>
> >>>doing
> >>>
> >>>>>>this can cause some other problems, like hyphenated words at a
> >>>>>>line break and the like.
> >>>>>
> >>>>>I've recently started using lucene and I'm running into the same
> >>>>>issue with the query parser. I'd like to use queries that
> >>>>>contain
> >>>
> >>>dashes
> >>>
> >>>>>in
> >>>>>the field name, but as far as I can tell it seems that the
> >>>
> >>>current
> >>>
> >>>>>query
> >>>>>grammar treats field names as terms, and so, as Terry notes, a
> >>>
> >>>dash
> >>>
> >>>>>becomes a token seperator.
> >>>>>
> >>>>>Terry suggests modifying the QueryParser.jj -- I would suspect by
> >>>>>creating a seperate non-terminal for field names.
> >>>>>
> >>>>>Has anyone done any work on this already? Is modifying
> >>>>>QueryParser.jj the best approach?
> >>>>>
> >>>>>Thanks,
> >>>>>jp
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]