Hi Steve

Thanks for your response. I was just wondering whether there is a difference between the regular expression you sent me i.e.
(i)   \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*

   and
(ii) \\b

as they lead to the same output. For example, the string search "testing a-new string=3/4" results in the same output :
Item is :
Item is : testing
Item is : Item is : a
Item is : -
Item is : new
Item is : Item is : string
Item is : =
Item is : 3
Item is : /
Item is : 4

What Id like to do though is remove the split over space characters so that the output is such :
Item is : testing
Item is : a
Item is : -
Item is : new
Item is : string
Item is : =
Item is : 3
Item is : /
Item is : 4

Im not great at regular expressions so would really appreciate if you could provide me with some insight into expression (i) .

Thanks for all your help
Rahil

Steven Rowe wrote:

Hi Rahil,

Rahil wrote:
I couldnt figure out a valid regular expression to write a valid Pattern.compile(String regex) which can tokenise a string into "O/E -
visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
"R", "-", "eye", "=", "6", "/", "24".

The following regular expression should match boundaries between word
and non-word, or between space and non-space, in either order, and
includes contiguous whitespace:

  \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*

Note that with the above regex, the "(%$#!)" in "some (%$#!) text" will
be tokenized as a single token.

Hope it helps,
Steve

Erick Erickson wrote:

Well, I'm not the greatest expert, but a quick look doesn't show me
anything
obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.

It sure looks like you're getting reasonable results given how you're
tokenizing.

If not that, you might want to think about PatternAnalyzer. It's in the
memory contribution section, see import
org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the
regex identifies what is NOT a token, rather than what is. This threw
me for
a bit.

I still claim that you could break the tokens up like "6", "/", "12", and
make SpanNearQuery work with a span of 0 (or 1, I don't remember right
now),
but that may well be more trouble than it's worth, it's up to you of
course.
What you get out of this is, essentially, is a query that's only
satisfied
if the terms you specify are right next to each other. So you'd find both
your documents in your example, since you would have tokenized "6", "/",
"12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But
since they're tokens that are next to each other in each doc,
searching with
a SpanNearQuery for "6", "/", and "12" that are "right next to each
other",
which you specify with a slop of 0 as I remember you should get both.

Alternatively, if you tokenize it this way, a PhraseQuery might work as
well, Thus, searching for "6 / 12" (as a phrase query and note the
spaces)
might be just what you want. You'd have to tokenize the query, but that's
relatively easy. This is probably much simpler than a SpanNearQuery
now that
I think about it.....

Be aware that if you use the *TermEnums we've been talking about, you'll
probably wind up wrapping them in a ConstantScoreQuery. And if you
have no
*other* terms, you won't get any relevancy out of your search. This
may be
important.....

Anyway, that's as creative as I can be Sunday night <G>. Best of luck....

Erick

On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote:

Hi Erick

Thanks for your response. There's a lot to chew on in your reply and Im
looking at the suggestions you've made.

Yeah I have Luke installed and have queried my index but there isn't any
great explanation Im getting out of it.  A query for "6/12" is sent as
"TERM:6/12" which is quite straight-forward. I did an explanation of the
query in my code though and got some more information but that too
wasn't of much help either.
--
Explanation explain = searcher.explain(query,0);

OUTPUT:
query: +TERM:6/12
explain.getDescription() : weight(TERM:6/12 in 0), product of:
Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
 2.0986123 = idf(docFreq=1)
 0.47650534 = queryNorm

Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
 0.0 = tf(termFreq(TERM:6/12)=0)
 2.0986123 = idf(docFreq=1)
 0.5 = fieldNorm(field=TERM, doc=0)

Number of results returned: 1
SampleLucene.displayIndexResults
SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
1.0    0    260278007    6/12 (finding)
--

My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
retain all non whitespace characters and not just letters and digits, I
introduced the following block of code in the overridden tokenStream( )

--
public TokenStream tokenStream(String fieldName, Reader reader) {

       return new CharTokenizer(reader) {

           protected char normalize(char c) {
                    return Character.toLowerCase(c);
           }
               protected boolean isTokenChar(char c) {
                      boolean type = false;
                      boolean space =   Character.isWhitespace(c);
                      boolean letDig =  Character.isLetterOrDigit(c);

                       if(letDig && !space) //letter or digit but not
whitespace
                           type = true;
                       else if(!letDig && !space)   //not letter,digit
or whitespace (retain non-whitespace characters)
                           type = true;
                       else if( !letDig && space)              //is not
a letter or digit but is a whitespace
                           type = false;
               return type;
           }
       };
     }

---
The problem is that when the term "6/12 (finding)" is tokenised, two
tokens are generated viz. '6/12' and '(finding)'. Therefore when I
search for '6/12' this term is returned as in a way it is an EXACT token
match.

However when the term "R-eye=6/12 (finding)" is tokenised it again
results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
look for '6/12' its no more an exact match since there is no token with
this EXACT value. A simple searcher.search(query) isnt useful to pull
out the partial token match.

I think it wont be useful to create separate tokens for "6", "/", "12"
or "R","-","eye","=", and so on. Im having a look at the RegexTermEnum
and WildcardTermEnum as they might possibily help.

Would appreciate your comments on the BaseAnalyzer tokenizer and query
explanation Ive received so far.

Thanks
Rahil

Erick Erickson wrote:

Most often, from what I've seen on this e-mail list, unexpected
results are
because you're not indexing on the tokens you *think* you're indexing.
Or
not searching on them. By that I mean that the analyzers you're using
are
behaving in ways you don't expect.

That said, I think you're getting exactly what you should. I suspect
you're
indexing tokens as follows
doc1: "6/12"  and "(finding)"
doc2: "R-eye=6/12" and "(finding)"

So it makes perfect sense that searching in 6/12 returns doc1 and
search on
R-eye=6/12 returns doc 2

So, first question: Have you actually used something like Luke (google
luke
lucene) to examine your index and see if what you've put in there is
what
you expect? What analyzer is your custom analyzer built upon and is it
doing
anything you're unaware of (for instance, lower-casing the 'R' in your
second example)?

Here's what I'd do.
1> get Luke and see what's actually in your index.
2> use searcher.explain (?) to see the query you're actually emitting.
3> if you make no headway, post the smallest code snippets you can
that
illustrate the problem. Folks would need the indexing AND searching
code.
As far as queryies like "contains" in java.... Well sure. Write a
filter
that filters on regular expressions or wildcards (you'll need
WildcardTermEnum and RegexTermEnum). Or index things differently (e.g.
index
"6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
doc 2.
Now your searches for "6/12" will work. Or index "6" "/", "12" and
"finding"
on doc1, index similarly for doc2, and use a SpanNearQuery with an
appropriate span value. Or....

This is all gobbldeygook if you haven't gotten a copy of "Lucene In
Action",
which you should read in order to get the most out of Lucene. It's for
the
1.4 code base, but the 2.0 Lucene code base isn't that much different.
More
importantly, it ties lots of stuff together. Also, the junit tests
that come
along with the Lucene code can be invaluable to show you how to do
something.

Hope this helps
Erick

On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote:

Hi

I have a custom-built Analyzer where I tokenize all non-whitespace
characters as well available in the field "TERM" (which is the only
field being tokenised).

If I now query my index file for a term "6/12" for instance, I get
back
only ONE result

SCORE    DESCRIPTIONSTATUS    CONCEPTID    TERM
1.0    0    260278007    6/12 (finding)

instead of TWO. There is another token in the index file of the form

2561280012    0    163939000    R-eye=6/12 (finding)    0    3    en

At first it wasn't quite obvious to me why this was happening. But
after
playing around a bit I realised that if I pass a query "R-eye=6/12"
instead, I will get this second result (but not the first one
now). Is
there a way to tweak the  Query query = parser.parse(searchString)
method so that I can get both the records if I query for "6/12".
Something like a 'contains' query in Java.

Will appreciate all help. Thanks a lot

Regards
Rahil

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to