Is the only way index pdfs is to convert it into a text and then only index
it ???



On 3/8/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:

Hi Aswin,

You can try pdfbox to convert the pdf documents to text and then use
Lucene to index the text.  The code for turning a pdf to text is very
simple:

private static string parseUsingPDFBox(string filename)
        {
            // document reader
            PDDocument doc = PDDocument.load(filename);
            // create stripper (wish I had the power to do that -
wouldn't leave the house)
            PDFTextStripper stripper = new PDFTextStripper();
            // get text from doc using stripper
            return stripper.getText(doc);
        }

Sachin

-----Original Message-----
From: ashwin kumar [mailto:[EMAIL PROTECTED]
Sent: 08 March 2007 09:37
To: java-user@lucene.apache.org
Subject: indexing pdfs

hi can some one help me by giving any sample programs for indexing pdfs
and .doc files

thanks
regards
ashwin


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed in
writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins
plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really
need to.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to