You could store the text contents compressed; I think extracting text from
PDF files is much more time-intensive than decompressing a stored field. And
text-only contents often compress very good. In my opinion, if the
(uncompressed) contents of the docs are not very large (so I mean several
megabytes each), I would prefer storing it in index.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Erik Hatcher [mailto:e...@ehatchersolutions.com]
> Sent: Saturday, March 07, 2009 12:46 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Highlighting and Dynamic Summaries
> 
> It depends :)
> 
> It's a trade-off.  If storing is not prohibitive, I recommend that as
> it makes life easier for highlighting.
> 
>       Erik
> 
> On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:
> 
> > hi
> > that's what i was thinking about.  i would need to get the file and
> > extract
> > the text again and then pass through the highlighter.  The other
> > option is
> > storing the content in the index the downside being index is going
> > to be
> > large.  Which would be the recommended approach?
> >
> > Cheers
> >
> > Amin
> >
> > On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher
> <e...@ehatchersolutions.com
> > >wrote:
> >
> >> With the caveat that if you're not storing the text you want
> >> highlighted,
> >> you'll have to retrieve it somehow and send it into the Highlighter
> >> yourself.
> >>
> >>       Erik
> >>
> >>
> >> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:
> >>
> >>
> >>> You should look at contrib/highlighter, which does exactly this.
> >>>
> >>> Mike
> >>>
> >>> Amin Mohammed-Coleman wrote:
> >>>
> >>> Hi
> >>>> I am currently indexing documents (pdf, ms word, etc) that are
> >>>> uploaded,
> >>>> these documents can be searched and what the search returns to
> >>>> the user
> >>>> are
> >>>> summaries of the documents.  Currently the summaries are
> >>>> extracted when
> >>>> indexing the file (summary constructed by taking the first 10
> >>>> lines of
> >>>> the
> >>>> document and stored in the index as field).  This is not ideal
> >>>> (static
> >>>> summary), and I was wondering if it would be possible to create a
> >>>> dynamic
> >>>> summary when a hit is found and highlight the terms found.  The
> >>>> content
> >>>> of
> >>>> the document is not stored in the index.
> >>>>
> >>>> So basically what I'm looking to do is:
> >>>>
> >>>> 1) PDF indexed
> >>>> 2) PDF body contains the word "search"
> >>>> 3) Do a search and return the hit
> >>>> 4) Construct a summary with the term "search" included.
> >>>>
> >>>> I'm not sure how to go about doing this (I presume it is
> >>>> possible).  I
> >>>> would
> >>>> be grateful for any advice.
> >>>>
> >>>>
> >>>> Cheers
> >>>> Amin
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to