> Brian, here is another idea for the query parser. To add the ability to mark
> terms as 'non analyzed'.
> 
> For example
> 
>  +body:xyz +folder:a.b.c.d
> 
> when 'folder' is a non tokenized field will not match if a.b.c.d is
> tokenized.
> 
> A possible syntax may be
> 
>  +body:xyz +folder:'a.b.c.d'

I understand the desire for such a feature (someone else suggested the
same thing.)  I am very wary of creating "new syntax" about which
you'll have to educate your users.  I know it sounds like you're only
asking for one feature, but if you think it'll be the last "special
case" that someone wants, well, I don't believe you.  I can't think of
any syntax that will clearly and unambiguously indicate "no
tokenization please."

Its one thing to add a syntax for the boost stuff, which only very
advanced users will use, but this is something that might be expected
of relatively beginning users -- "you have to put the author's name in
single quotes, but the article title in double quotes."  No way.

I think the request for this underscores an issue that's been bugging
me for a while -- since its so important that you use the same
analyzer for queries as for indexing, maybe the analyzer should
actually be stored in the index store.

I could see two ways to address this issue:

1 (complicated way): When the index store is created, register an
analyzer for each field (could be the same one.)  A serialized copy of
the analyzer is stored in the index base, and queries on that field
are automatically processed with it.

2 (simpler, less complete way): Have a way of telling the query parser
that "these fields use these analyzers", or at the very least, "these
fields don't get tokenized with an analyzer."


> BTW, it will be great if the syntax of the query parser will allow
> to describe any query that is supported by Lucene standard
> classes. This will provide a common language to describe queries and
> will provide an alternative, and more intuitive, way to construct
> queries.

Nice goal, and I'm happy to try for it if practical, but I think a
more important rule is that they syntax should be simple and hard to
mess up.  I would -1 adding any syntax which will only be used by 5%
of the users, but which might confuse the other 95%, and the same with
any syntax which will be widely used but which requires more than a
sentence or two of explanation to the "average user."  Remember, the
people who create these queries are used to using Google; we should
support a query language which is familiar (or at least easily
explained to those users.  Advanced users can still create their own
with the query classes.



_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev

Reply via email to