You can use PDFBOX. 

http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.h
tml 


Sincerely,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

-----Original Message-----
From: maxmil [mailto:m...@alwayssunny.com] 
Sent: Friday, December 12, 2008 3:34 AM
To: java-user@lucene.apache.org
Subject: Beginner: Best way to index and display orginal text of pdfs in
search results


Hi,

This is the first time i am using Lucene.

I need to index pdf's with very few fields, title, date and body (long
field) for a web based search.

The results i need to display have to show not only the documents found
but
for each document a snapshot of the text where the search term has been
found. This is analogous to the way google displays search results. That
is
to say

 ... some words and first instance of search Term and some more words
...
some more words second instance of search term and some more words... 

etc.

To do this i would need the original text of the document for each hit.
As
far as i understand Lucene does not save the original text of the
document
in the index.

I am not using a database and would prefer not to have to store the
original
document text elsewhere.

One way i could do this would be to take the hits from Lucene and reopen
each pdf to extract the original text at run time however i fear that
with
many results this would be very slow. 

What would you recommend me to do?

Thanks

max
-- 
View this message in context:
http://www.nabble.com/Beginner%3A-Best-way-to-index-and-display-orginal-
text-of-pdfs-in-search-results-tp20971377p20971377.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to