test case - RE: Slash Problem

Spencer, Dave Mon, 25 Nov 2002 17:19:04 -0800

I'm sure there's something that I'm missing here.
Let's say we have an index of a web site with 2 fields,
"body", and "url".
Body is formed via Field.Text(...,Reader) and the url field by 
Field.Keyword(), thus the URL is not tokenized but is searchable.


I use StandardAnalyzer and I want to find
the Document with a matching URL, and I want
to use QueryParser to parse the users queries.

I'm using v1.2.

It seems that, if I'm correct, one design problem is that the Analyzer 
does not have a reference to an index, so it doesn't know
if a field has been tokenized. It probably should not tokenize
queries against an untokenized field. AFAIAK the queries against
untokenized fields are always tokenized and there is no way to tell
the QueryParser to not tokenize a field.

I have attached a test program that shows the behavior and
sample output.
The "From:" lines are user queries.
The "To:" lines are the result of calling QueryParser and then
Query.toString().

The 3rd and 4th From/To lines below are the key ones.
The goal is to enter a query like url:http://wwww.tropo.com/
or url:"http://www.tropo.com/"; and not tokenize the
'http://www.tropo.com/'.
I tried backslashes too to no avail (url:http\://www.tropo.com/)

      

========================================================================
==
C:\proj\tropo_java>java com.tropo.lucene.KeywordProblem
From: foo
To  : foo

From: body:foo
To  : body:foo

From: url:http://www.tropo.com/                        <-- first attempt
To  : http                                             <-- first
problem, ok, we gotta quote

From: url:"http://www.tropo.com/";                      <-- second
attempt
To  : "http www.tropo.com"                             <-- second
problem, colon and slashes missing


========================================================================
==
package com.tropo.lucene;

import java.io.*;
import java.util.*;

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;

public class KeywordProblem
{
        /**
         *
         */
        public static void main(String[] args)
                throws Throwable
        {
                String body = "body";
                String url = "url";

                String[] lines = new String[] {
                        "foo",
                        "body:foo",
                        "url:http://www.tropo.com/";,
                        "url:\"http://www.tropo.com/\"";
                };

                Analyzer a = new StandardAnalyzer();
                for ( int i = 0; i < lines.length; i++)
                {
                        Query query = QueryParser.parse( lines[i], url,
a);
                        o.println( "From: " + lines[i]);
                        o.println( "To  : " + query.toString( url));
                        o.println();
                }
        }
        private static PrintStream o = System.out;
}




-----Original Message-----
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 12:13 PM
To: Lucene Users List
Subject: Re: Slash Problem


Dave,

My recent testing suggests that when the field is not tokenized, it is
not
split as you suggest.  When I search the "path" field using
"path:1102/A*" I
get precisely what I am looking for (though I discovered the lowercase
mechanism isn't applied to this field and the query is case-sensitive -
not
the uppercase 'A' above.)

Regards,

Terry

----- Original Message -----
From: "Spencer, Dave" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, November 25, 2002 2:58 PM
Subject: RE: Slash Problem


Funny, I have more or less the same question I've been meaning to post.
I think the answer is going to be that the analyzer applies to all parts
of
a query, even to untokenized fields, which to me seems wrong.

So I think if you have a query like

body:foo uri:"/alpha/beta"

With 'body' being tokenized and 'uri' not tokenized, I think that
the analyzer applies to "/alpha/beta" and breaks it into "alpha beta"
which is not desired...


-----Original Message-----
From: Terry Steichen [mailto:[EMAIL PROTECTED]]
Sent: Monday, November 25, 2002 9:26 AM
To: Lucene Users List
Subject: Re: Slash Problem


Rob,

I presume that means that you used backslashes (in the url) rather than
forward slashes (in the path).  I had planned to test that as a
workaround
and it's good to know that you've already tested that successfully.

But why is this necessary?  Why doesn't the escape ('\') allow the use
of a
backslash?

Regards,

Terry

----- Original Message -----
From: "Rob Outar" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, November 25, 2002 12:01 PM
Subject: RE: Slash Problem


> I don't know if this helps but I had exact same problem, I then stored
the
> URI instead of the path, I was then able to search on the URI.
>
> Thanks,
>
> Rob
>
>
> -----Original Message-----
> From: Terry Steichen [mailto:[EMAIL PROTECTED]]
> Sent: Monday, November 25, 2002 11:53 AM
> To: Lucene Users Group
> Subject: Slash Problem
>
>
> I've got a Text field (tokenized, indexed, stored) called 'path' which
> contains a string in the form of '1102\A3345-12RT.XML'.  When I submit
a
> query like "path:1102*" it works fine.  But, when I try to be more
specific
> (such as "path:1102\a*" or "path:1102*a*") it fails.  I've tried
escaping
> the slash ("path:1102\\a*") but that also fails.
>
> I'm using the StandardAnalyzer and the default QueryParser.  Could
anyone
> suggest what's going wrong here?
>
> Regards,
>
> Terry
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>



--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>




--
To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

test case - RE: Slash Problem

Reply via email to