Hi Aswin,

You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text.  The code for turning a pdf to text is very
simple:

private static string parseUsingPDFBox(string filename)
        {
            // document reader
            PDDocument doc = PDDocument.load(filename);
            // create stripper (wish I had the power to do that -
wouldn't leave the house)
            PDFTextStripper stripper = new PDFTextStripper();
            // get text from doc using stripper
            return stripper.getText(doc);
        }

Sachin

-----Original Message-----
From: ashwin kumar [mailto:[EMAIL PROTECTED] 
Sent: 08 March 2007 09:37
To: java-user@lucene.apache.org
Subject: indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs
and .doc files

thanks
regards
ashwin


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to