Hi Erick
Im having trouble with writing a good regular expression for the
PatternAnalyzer to deal with word and non-word characters.I couldnt
figure out a valid regular expression to write a valid
Pattern.compile(String regex) which can tokenise a string into "O/E -
visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
"R", "-", "eye", "=", "6", "/", "24". Ive given it quite a few shots but
am now totally frustrated with it. I can either tokenise at \W+ or \w+
but not both.
Could you please help.
Thanks a lot. Much appreciate it.
Regards
Rahil
Erick Erickson wrote:
> Well, I'm not the greatest expert, but a quick look doesn't show me
> anything
> obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
> Although I don't remember whether WhiteSpaceAnalyzer lowercases or
not.
>
> It sure looks like you're getting reasonable results given how you're
> tokenizing.
>
> If not that, you might want to think about PatternAnalyzer. It's in
the
> memory contribution section, see import
> org.apache.lucene.index.memory.PatternAnalyzer. One note of
caution, the
> regex identifies what is NOT a token, rather than what is. This threw
> me for
> a bit.
>
> I still claim that you could break the tokens up like "6", "/", "12",
and
> make SpanNearQuery work with a span of 0 (or 1, I don't remember right
> now),
> but that may well be more trouble than it's worth, it's up to you of
> course.
> What you get out of this is, essentially, is a query that's only
> satisfied
> if the terms you specify are right next to each other. So you'd find
both
> your documents in your example, since you would have tokenized "6",
"/",
> "12" in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc.
But
> since they're tokens that are next to each other in each doc,
> searching with
> a SpanNearQuery for "6", "/", and "12" that are "right next to each
> other",
> which you specify with a slop of 0 as I remember you should get both.
>
> Alternatively, if you tokenize it this way, a PhraseQuery might
work as
> well, Thus, searching for "6 / 12" (as a phrase query and note the
> spaces)
> might be just what you want. You'd have to tokenize the query, but
that's
> relatively easy. This is probably much simpler than a SpanNearQuery
> now that
> I think about it.....
>
> Be aware that if you use the *TermEnums we've been talking about,
you'll
> probably wind up wrapping them in a ConstantScoreQuery. And if you
> have no
> *other* terms, you won't get any relevancy out of your search. This
> may be
> important.....
>
> Anyway, that's as creative as I can be Sunday night <G>. Best of
luck....
>
> Erick
>
> On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi Erick
>>
>> Thanks for your response. There's a lot to chew on in your reply
and Im
>> looking at the suggestions you've made.
>>
>> Yeah I have Luke installed and have queried my index but there isn't
any
>> great explanation Im getting out of it. A query for "6/12" is
sent as
>> "TERM:6/12" which is quite straight-forward. I did an explanation of
the
>> query in my code though and got some more information but that too
>> wasn't of much help either.
>> --
>> Explanation explain = searcher.explain(query,0);
>>
>> OUTPUT:
>> query: +TERM:6/12
>> explain.getDescription() : weight(TERM:6/12 in 0), product of:
>> Detail 0 : 0.99999994 = queryWeight(TERM:6/12), product of:
>> 2.0986123 = idf(docFreq=1)
>> 0.47650534 = queryNorm
>>
>> Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of:
>> 0.0 = tf(termFreq(TERM:6/12)=0)
>> 2.0986123 = idf(docFreq=1)
>> 0.5 = fieldNorm(field=TERM, doc=0)
>>
>> Number of results returned: 1
>> SampleLucene.displayIndexResults
>> SCORE DESCRIPTIONSTATUS CONCEPTID TERM
>> 1.0 0 260278007 6/12 (finding)
>> --
>>
>> My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to
>> retain all non whitespace characters and not just letters and
digits, I
>> introduced the following block of code in the overridden
tokenStream( )
>>
>> --
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>> return new CharTokenizer(reader) {
>>
>> protected char normalize(char c) {
>> return Character.toLowerCase(c);
>> }
>> protected boolean isTokenChar(char c) {
>> boolean type = false;
>> boolean space = Character.isWhitespace(c);
>> boolean letDig =
Character.isLetterOrDigit(c);
>>
>> if(letDig && !space) //letter or digit but
not
>> whitespace
>> type = true;
>> else if(!letDig && !space) //not
letter,digit
>> or whitespace (retain non-whitespace characters)
>> type = true;
>> else if( !letDig && space) //is
not
>> a letter or digit but is a whitespace
>> type = false;
>> return type;
>> }
>> };
>> }
>>
>> ---
>> The problem is that when the term "6/12 (finding)" is tokenised, two
>> tokens are generated viz. '6/12' and '(finding)'. Therefore when I
>> search for '6/12' this term is returned as in a way it is an EXACT
token
>> match.
>>
>> However when the term "R-eye=6/12 (finding)" is tokenised it again
>> results in two tokens viz. 'R-eye=6/12' and '(finding)'. So now if I
>> look for '6/12' its no more an exact match since there is no token
with
>> this EXACT value. A simple searcher.search(query) isnt useful to pull
>> out the partial token match.
>>
>> I think it wont be useful to create separate tokens for "6", "/",
"12"
>> or "R","-","eye","=", and so on. Im having a look at the
RegexTermEnum
>> and WildcardTermEnum as they might possibily help.
>>
>> Would appreciate your comments on the BaseAnalyzer tokenizer and
query
>> explanation Ive received so far.
>>
>> Thanks
>> Rahil
>>
>> Erick Erickson wrote:
>>
>> > Most often, from what I've seen on this e-mail list, unexpected
>> > results are
>> > because you're not indexing on the tokens you *think* you're
indexing.
>> Or
>> > not searching on them. By that I mean that the analyzers you're
using
>> are
>> > behaving in ways you don't expect.
>> >
>> > That said, I think you're getting exactly what you should. I
suspect
>> > you're
>> > indexing tokens as follows
>> > doc1: "6/12" and "(finding)"
>> > doc2: "R-eye=6/12" and "(finding)"
>> >
>> > So it makes perfect sense that searching in 6/12 returns doc1 and
>> > search on
>> > R-eye=6/12 returns doc 2
>> >
>> > So, first question: Have you actually used something like Luke
(google
>> > luke
>> > lucene) to examine your index and see if what you've put in
there is
>> what
>> > you expect? What analyzer is your custom analyzer built upon and is
it
>> > doing
>> > anything you're unaware of (for instance, lower-casing the 'R' in
your
>> > second example)?
>> >
>> > Here's what I'd do.
>> > 1> get Luke and see what's actually in your index.
>> > 2> use searcher.explain (?) to see the query you're actually
emitting.
>> > 3> if you make no headway, post the smallest code snippets you can
>> that
>> > illustrate the problem. Folks would need the indexing AND searching
>> code.
>> >
>> > As far as queryies like "contains" in java.... Well sure. Write a
>> filter
>> > that filters on regular expressions or wildcards (you'll need
>> > WildcardTermEnum and RegexTermEnum). Or index things differently
(e.g
.
>> > index
>> > "6/12" and "finding" on doc1 and "r". "eye" "6/12" and "finding" on
>> > doc 2.
>> > Now your searches for "6/12" will work. Or index "6" "/", "12" and
>> > "finding"
>> > on doc1, index similarly for doc2, and use a SpanNearQuery with an
>> > appropriate span value. Or....
>> >
>> > This is all gobbldeygook if you haven't gotten a copy of "Lucene In
>> > Action",
>> > which you should read in order to get the most out of Lucene. It's
for
>> > the
>> > 1.4 code base, but the 2.0 Lucene code base isn't that much
different.
>> > More
>> > importantly, it ties lots of stuff together. Also, the junit tests
>> > that come
>> > along with the Lucene code can be invaluable to show you how to do
>> > something.
>> >
>> > Hope this helps
>> > Erick
>> >
>> > On 10/1/06, Rahil <[EMAIL PROTECTED]> wrote:
>> >
>> >>
>> >> Hi
>> >>
>> >> I have a custom-built Analyzer where I tokenize all non-whitespace
>> >> characters as well available in the field "TERM" (which is the
only
>> >> field being tokenised).
>> >>
>> >> If I now query my index file for a term "6/12" for instance, I get
>> back
>> >> only ONE result
>> >>
>> >> SCORE DESCRIPTIONSTATUS CONCEPTID TERM
>> >> 1.0 0 260278007 6/12 (finding)
>> >>
>> >> instead of TWO. There is another token in the index file of the
form
>> >>
>> >> 2561280012 0 163939000 R-eye=6/12 (finding) 0
3 en
>> >>
>> >> At first it wasn't quite obvious to me why this was happening. But
>> after
>> >> playing around a bit I realised that if I pass a query
"R-eye=6/12"
>> >> instead, I will get this second result (but not the first one
>> now). Is
>> >> there a way to tweak the Query query = parser.parse(searchString)
>> >> method so that I can get both the records if I query for "6/12".
>> >> Something like a 'contains' query in Java.
>> >>
>> >> Will appreciate all help. Thanks a lot
>> >>
>> >> Regards
>> >> Rahil
>> >>
>> >>
>> >>
---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]