You could store the text contents compressed; I think extracting text from PDF files is much more time-intensive than decompressing a stored field. And text-only contents often compress very good. In my opinion, if the (uncompressed) contents of the docs are not very large (so I mean several megabytes each), I would prefer storing it in index.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Erik Hatcher [mailto:e...@ehatchersolutions.com] > Sent: Saturday, March 07, 2009 12:46 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene Highlighting and Dynamic Summaries > > It depends :) > > It's a trade-off. If storing is not prohibitive, I recommend that as > it makes life easier for highlighting. > > Erik > > On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote: > > > hi > > that's what i was thinking about. i would need to get the file and > > extract > > the text again and then pass through the highlighter. The other > > option is > > storing the content in the index the downside being index is going > > to be > > large. Which would be the recommended approach? > > > > Cheers > > > > Amin > > > > On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher > <e...@ehatchersolutions.com > > >wrote: > > > >> With the caveat that if you're not storing the text you want > >> highlighted, > >> you'll have to retrieve it somehow and send it into the Highlighter > >> yourself. > >> > >> Erik > >> > >> > >> On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote: > >> > >> > >>> You should look at contrib/highlighter, which does exactly this. > >>> > >>> Mike > >>> > >>> Amin Mohammed-Coleman wrote: > >>> > >>> Hi > >>>> I am currently indexing documents (pdf, ms word, etc) that are > >>>> uploaded, > >>>> these documents can be searched and what the search returns to > >>>> the user > >>>> are > >>>> summaries of the documents. Currently the summaries are > >>>> extracted when > >>>> indexing the file (summary constructed by taking the first 10 > >>>> lines of > >>>> the > >>>> document and stored in the index as field). This is not ideal > >>>> (static > >>>> summary), and I was wondering if it would be possible to create a > >>>> dynamic > >>>> summary when a hit is found and highlight the terms found. The > >>>> content > >>>> of > >>>> the document is not stored in the index. > >>>> > >>>> So basically what I'm looking to do is: > >>>> > >>>> 1) PDF indexed > >>>> 2) PDF body contains the word "search" > >>>> 3) Do a search and return the hit > >>>> 4) Construct a summary with the term "search" included. > >>>> > >>>> I'm not sure how to go about doing this (I presume it is > >>>> possible). I > >>>> would > >>>> be grateful for any advice. > >>>> > >>>> > >>>> Cheers > >>>> Amin > >>>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org