Re: Search Expansion - one step closer ... !

Erik Hatcher Sun, 04 Apr 2004 10:42:52 -0700

On Apr 4, 2004, at 12:28 PM, [EMAIL PROTECTED] wrote:

Hi Eric, all,

with a 'k' :))

Several of my terms are in fact keyphrases with 2 or
more words separated by whitespaces, e.g. 'host
defense'.

You've not told us how you are indexing. What field type are you using? From your description it seems you want to analyze text as it may have special characters.

These are the types of decisions that really matter when using Lucene. My first hunch is that you need a domain-aware analyzer that knows when it sees "host defense", "Host-Defense", "Host_DEFENSE" that it tokenizes it as "host defense".

Or perhaps you need an analyzer that does a floating window of two words and bi-grams them into single tokens?

I don't really have any quick and easy answers for you - you're asking for domain specific common sense in the analysis process from what I am gathering, and Lucene itself makes this possible but does not give it to you for free.

You could, perhaps, take an easier way out and run text through an Analyzer as you build up your query, without using QueryParser. Look, again, at my AnalysisDemo code in the java.net article.... just pull what you need from there to process a TokenStream out of an Analyzer.

Erik

They are obviously not handled properly during the
construction of the boolean query because 'host
defense' is not found though it is in the field.
Replacing the whitespace inbetween the words by an
underscore ('host_defense' which is recognised by query
parser and yields similar results to double

quoting, e.g. "host defense") did not retrieve either
...

I had to convert to lowercase before sending to his
function because - unlike in the QueryParser call - no
analyzer is used at the moment.
Indexing was done with StandardAnalyzer so I would
prefer using an analyser at search as well.
The terms are well formed because they are taken from a
domain ontology but there could be inconsistencies in
spelling between what is in the ontology and

what is in the field, e.g. as 'host-defense' which
would need equivalent handling to 'host defense'. Guess
this will be dealt with by the analyser - but where do
I

put it within the current code (see below) with boolean
query generation ?

Any hints ?
Anyway - thanks a lot so far !

Holger

Code follows:

    public String[] doSearchBQ(String index_path,
String[] myquery){
    // does query processing without QueryParser but by
contructing a boolean query     
    try {
      Searcher searcher = new IndexSearcher(index_path);
      Analyzer analyzer = new StandardAnalyzer();
        
        BooleanQuery query = new BooleanQuery();
        
        //for each term to add:
        for (int j=0; j<myquery.length; j++){
        query.add(new TermQuery(new Term("subject",
myquery[j])), false, false);
        }
        
        Hits hits = searcher.search(query);
        
        lucene_out = new String[hits.length()]; 
        for (int i = 0; i < hits.length(); i ++)
         {
            Document doc = hits.doc(i);
            String name = doc.get("filename");
            lucene_out[i] = name + "|" + doc.get("subject") +
"|" + doc.get("message");
        }
      searcher.close();

    } catch (Exception e) {
      System.out.println(" caught a " + e.getClass() +
                         "\n with message: " + e.getMessage());
    }
    return lucene_out;
  }

___________________________________________________
The ALL NEW CS2000 from CompuServe
 Better!  Faster! More Powerful!
 250 FREE hours! Sign-on Now!
 http://www.compuserve.com/trycsrv/cs2000/webmail/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Expansion - one step closer ... !

Reply via email to