RE: Highlighter withField.Store.NO

Moray McConnachie Tue, 10 Mar 2009 04:09:23 -0700

Assuming the % of documents hit by search in any particular time period is very 
low, as I would expect in a mail system, then it will be more effective for 
such a large database to keep the Lucene index size down by not storing the 
complete contents - so you need Field Store NO, as you already established.


Highlighter has no magic way to retrieve the content, so when you use 
Highlighter you will need to pass it the full content for each search result, 
as described by Ben below. I think perhaps when Ben says cache he just means 
load the content from your main content store.

So in your code, instead of 

> string s = resdoc.Get("CONTENT");

you need 

  string s=sContent;

Obviously a trivial example!

In the real world it might be

  string s=GetEmailBody(EmailID);

Or whatever.

The use of a cache in another sense - i.e. a place to temporarily store data as 
it is retrieved from your main store in case the same content is needed again 
soon - might be advisable depending on how expensive it is to retrieve the full 
contents from your store, and how good a job of caching your content retrieval 
system does. Most databases and filesystems will do a good job of caching, so 
if the content is stored in a simple way and there is not significant latency 
between search and content store, you do not need a local cache.

Yours,
Moray

------------------------------------- 
Moray McConnachie
Head of IS        +44 1865 261 600
Oxford Analytica  http://www.oxan.com

-----Original Message-----
From: Pál Barnabás [mailto:[email protected]] 
Sent: 10 March 2009 10:27
To: [email protected]
Subject: Re: Highlighter withField.Store.NO

thx for quick answer,
This solution is not possible for me. I want to index millions of e-mails with 
attachments (doc, pdf, etc). The mails and the files are stored already, saving 
the text content in a separate cache is not acceptable.
I tried to save the with with Field.Store.COMPRESS option, but the performance 
was very low (3x indexing time).

2009/3/9 Ben Martz <[email protected]>

> I use the Highlighter class in a shipping product in which I do not 
> store values in the index. Instead I independently load the contents 
> from my own cache and pass that to Highlighter.GetBestFragments(). The 
> only disadvantage is that depending on the size of your contents and 
> the speed of your contents cache this can make Highlighting a very 
> expensive operation so pay very careful attention to how and when you 
> load your contents data.
>
> On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <[email protected]> wrote:
>
> > Hi,
> > I'm trying to highlight the keyword in the search result.
> > This is my code:
> > ------------------------------------------------------------------
> > string indexdir = @"D:\temp\index_testing";
> >            if (System.IO.Directory.Exists(indexdir))
> >                System.IO.Directory.Delete(indexdir, true);
> >
> >            IndexWriter writer = new IndexWriter(indexdir, new 
> > Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
> >            // demo text
> >            string scontent = "First, we parse the user-entered query
> string
> > indicating that we want to match ...";
> >
> >            for (int i = 0; i < 100; i++)
> >            {
> >                Document doc = new Document();
> >
> >                doc.Add(new Field("ID", i.ToString(), 
> > Field.Store.YES, Field.Index.UN_TOKENIZED));
> >                doc.Add(new Field("CONTENT", scontent, 
> > Field.Store.YES, Field.Index.TOKENIZED));
> >
> >                writer.AddDocument(doc);
> >            }
> >
> >            writer.Close();
> >
> >            IndexReader reader = IndexReader.Open(indexdir);
> >            Searcher searcher = new IndexSearcher(reader);
> >            Analyzer analyzer = new
> > Lucene.Net.Analysis.Standard.StandardAnalyzer();
> >
> >            MultiFieldQueryParser parser = new 
> > MultiFieldQueryParser(new string[] { "CONTENT" }, analyzer);
> >
> >            Query query = parser.Parse("indicating");
> >            query = query.Rewrite(reader);
> >            Trace.WriteLine("Searching for: " + query.ToString());
> >
> >            Lucene.Net.Search.Hits hits = searcher.Search(query);
> >
> >            SimpleHTMLFormatter formatter = new 
> > SimpleHTMLFormatter("<b class='term'>", "</b>");
> >
> >            QueryScorer scorer = new QueryScorer(query);
> >
> >            Highlighter highlighter = new Highlighter(formatter, scorer);
> >            highlighter.SetTextFragmenter(new 
> > SimpleFragmenter(2000));
> >
> >            for (int i = 0; i < hits.Length(); i++)
> >            {
> >                Document resdoc = hits.Doc(i);
> >
> >                string s = resdoc.Get("CONTENT");
> >                // s is null if Field.Store is NO
> >                TokenStream tsTitle = analyzer.TokenStream("CONTENT", 
> > new System.IO.StringReader(s));
> >                string hl = highlighter.GetBestFragment(tsTitle, s);
> >            }
> > ------------------------------------------------------------------
> >
> > The problem is when the content is not stored in the index 
> > (Field.Store.NO), the result document does not contain the value. Is 
> > it possible to use the Highlighter class in this case ? or what's 
> > the best way to highlight the search result? is it possible to get 
> > all tokens for the hits.Doc(i)?
> >
>
>
>
> --
> 13:37 - Someone stole the precinct toilet. The cops have nothing to go on.
> 14:37 - Officers dispatched to a daycare where a three-year-old was 
> resisting a rest.
> 21:11 - Hole found in nudist camp wall. Officers are looking into it.
>

RE: Highlighter withField.Store.NO

Reply via email to