Re: getting problem while indexing pdf files with pdfbox

Erick Erickson Tue, 17 Jul 2007 06:52:37 -0700

Offhand I'd assume that your problem is using PDFbox. Have you
tried printing out the docText string you get  back from


docText = stripper.getText(new PDDocument(cosDoc))?

I'd recommend you assure yourself that you get valid text back from
the PDF document before worrying about indexing it.

Best
Erick

On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:



http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf

hi all,

i am able to convert a pdf in to a text file using pdfbox.
and this is the code that I used, but I am not able to index it

// code for parsing and making index

                public Document getDocument(InputStream is)
                {
                COSDocument cosDoc = null;
                try {
                        PDFParser parser = new PDFParser(is);
                        parser.parse();
                cosDoc = parser.getDocument();
                }
                catch (IOException e) {
                e.printStackTrace();
                                }
                String docText = null;
                try {
                PDFTextStripper stripper = new PDFTextStripper();
                docText = stripper.getText(new PDDocument(cosDoc));
                }
                catch (IOException e) {
                        e.printStackTrace();
                }
                Document doc = new Document();
                if (docText != null) {
                doc.add(new Field("body", docText, Field.Store.YES,
                        Field.Index.TOKENIZED));
                }
                return doc;
                }

                public static void main(String[] args) throws Exception {
                        TestPDFParser handler = new TestPDFParser();

                Document doc = handler.getDocument(new FileInputStream(new
File("D:\\lucenePdf\\DRra0026.pdf")));

                System.out.println(doc);

                //Following code is for making index

                IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
new
StandardAnalyzer(), true);

                f_writer.addDocument(doc);

                }
                }
//code for searching a particular string..

public static void main(String[] args) throws Exception {
        String indexDir = "D:\\lucenePdf";
        String q = "RA0083";


        Directory fsDir = FSDirectory.getDirectory(indexDir);
        IndexSearcher is = new IndexSearcher(fsDir);

        Query query = new QueryParser("body", new
StandardAnalyzer()).parse(q);

        Hits hits = is.search(query);
        System.out.println("Found " + hits.length() + " documents that
matched query '" + q + "':");
        for (int i = 0; i < hits.length(); i++) {
            Document doc = hits.doc(i);

        }
    }


When I run the above code...I get folowing output as a result of running
indexer class


Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
: RA0083
000099062000062000000021000000100220468148001102006PAYOUT : RA0083
000099062000063000000021000000100330468153601102006PAYOUT : RA0083
000099062000064700000021000000100440468155401102006PAYOUT : RA0083
000099062000065700000021000000100550468156201102006PAYOUT : RA0083

and following  files are generated in the specified path..

segments.gen
write.lock
segments_4


but when I run the search class it gives the result as:

Found 0 documents that matched query 'RA0083':

I am also attaching the corresponding pdf file for reference.
It seems as the index is not getting created..

Please help me with some of your inputs,it will be very helpfull for me.
--
View this message in context:
http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting problem while indexing pdf files with pdfbox

Reply via email to