Interesting problem. As other have pointed out Highlighter is an expensive
operation, so use it sparingly.
A hack approach might be to create a faux highlight at index time, for
example extract the title and sections of the body + attachment. There are
approaches to find it such as summarization techniques etc and store this
pseudo-highlight. Obviously this would *not* be a search specific
highlight.
On Tue, Mar 10, 2009 at 7:08 AM, Moray McConnachie <mmcconna@
oxford-analytica.com> wrote:
> Assuming the % of documents hit by search in any particular time period is
> very low, as I would expect in a mail system, then it will be more effective
> for such a large database to keep the Lucene index size down by not storing
> the complete contents - so you need Field Store NO, as you already
> established.
>
> Highlighter has no magic way to retrieve the content, so when you use
> Highlighter you will need to pass it the full content for each search
> result, as described by Ben below. I think perhaps when Ben says cache he
> just means load the content from your main content store.
>
> So in your code, instead of
>
> > string s = resdoc.Get("CONTENT");
>
> you need
>
> string s=sContent;
>
> Obviously a trivial example!
>
> In the real world it might be
>
> string s=GetEmailBody(EmailID);
>
> Or whatever.
>
> The use of a cache in another sense - i.e. a place to temporarily store
> data as it is retrieved from your main store in case the same content is
> needed again soon - might be advisable depending on how expensive it is to
> retrieve the full contents from your store, and how good a job of caching
> your content retrieval system does. Most databases and filesystems will do a
> good job of caching, so if the content is stored in a simple way and there
> is not significant latency between search and content store, you do not need
> a local cache.
>
> Yours,
> Moray
>
> -------------------------------------
> Moray McConnachie
> Head of IS +44 1865 261 600
> Oxford Analytica http://www.oxan.com
>
> -----Original Message-----
> From: Pál Barnabás [mailto:[email protected]]
> Sent: 10 March 2009 10:27
> To: [email protected]
> Subject: Re: Highlighter withField.Store.NO
>
> thx for quick answer,
> This solution is not possible for me. I want to index millions of e-mails
> with attachments (doc, pdf, etc). The mails and the files are stored
> already, saving the text content in a separate cache is not acceptable.
> I tried to save the with with Field.Store.COMPRESS option, but the
> performance was very low (3x indexing time).
>
> 2009/3/9 Ben Martz <[email protected]>
>
> > I use the Highlighter class in a shipping product in which I do not
> > store values in the index. Instead I independently load the contents
> > from my own cache and pass that to Highlighter.GetBestFragments(). The
> > only disadvantage is that depending on the size of your contents and
> > the speed of your contents cache this can make Highlighting a very
> > expensive operation so pay very careful attention to how and when you
> > load your contents data.
> >
> > On Mon, Mar 9, 2009 at 8:14 AM, Pál Barnabás <[email protected]> wrote:
> >
> > > Hi,
> > > I'm trying to highlight the keyword in the search result.
> > > This is my code:
> > > ------------------------------------------------------------------
> > > string indexdir = @"D:\temp\index_testing";
> > > if (System.IO.Directory.Exists(indexdir))
> > > System.IO.Directory.Delete(indexdir, true);
> > >
> > > IndexWriter writer = new IndexWriter(indexdir, new
> > > Lucene.Net.Analysis.Standard.StandardAnalyzer(), true);
> > > // demo text
> > > string scontent = "First, we parse the user-entered query
> > string
> > > indicating that we want to match ...";
> > >
> > > for (int i = 0; i < 100; i++)
> > > {
> > > Document doc = new Document();
> > >
> > > doc.Add(new Field("ID", i.ToString(),
> > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > > doc.Add(new Field("CONTENT", scontent,
> > > Field.Store.YES, Field.Index.TOKENIZED));
> > >
> > > writer.AddDocument(doc);
> > > }
> > >
> > > writer.Close();
> > >
> > > IndexReader reader = IndexReader.Open(indexdir);
> > > Searcher searcher = new IndexSearcher(reader);
> > > Analyzer analyzer = new
> > > Lucene.Net.Analysis.Standard.StandardAnalyzer();
> > >
> > > MultiFieldQueryParser parser = new
> > > MultiFieldQueryParser(new string[] { "CONTENT" }, analyzer);
> > >
> > > Query query = parser.Parse("indicating");
> > > query = query.Rewrite(reader);
> > > Trace.WriteLine("Searching for: " + query.ToString());
> > >
> > > Lucene.Net.Search.Hits hits = searcher.Search(query);
> > >
> > > SimpleHTMLFormatter formatter = new
> > > SimpleHTMLFormatter("<b class='term'>", "</b>");
> > >
> > > QueryScorer scorer = new QueryScorer(query);
> > >
> > > Highlighter highlighter = new Highlighter(formatter,
> scorer);
> > > highlighter.SetTextFragmenter(new
> > > SimpleFragmenter(2000));
> > >
> > > for (int i = 0; i < hits.Length(); i++)
> > > {
> > > Document resdoc = hits.Doc(i);
> > >
> > > string s = resdoc.Get("CONTENT");
> > > // s is null if Field.Store is NO
> > > TokenStream tsTitle = analyzer.TokenStream("CONTENT",
> > > new System.IO.StringReader(s));
> > > string hl = highlighter.GetBestFragment(tsTitle, s);
> > > }
> > > ------------------------------------------------------------------
> > >
> > > The problem is when the content is not stored in the index
> > > (Field.Store.NO), the result document does not contain the value. Is
> > > it possible to use the Highlighter class in this case ? or what's
> > > the best way to highlight the search result? is it possible to get
> > > all tokens for the hits.Doc(i)?
> > >
> >
> >
> >
> > --
> > 13:37 - Someone stole the precinct toilet. The cops have nothing to go
> on.
> > 14:37 - Officers dispatched to a daycare where a three-year-old was
> > resisting a rest.
> > 21:11 - Hole found in nudist camp wall. Officers are looking into it.
> >
>
>