Hi Erick, Befoe indexing I have printed the doc, and I have given the output also.It is printing well. Kindly please check my post again following...
" System.out.println(doc); //Following code is for making index" and the corresponding output is... Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT: RA0083 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083 which is as expected...but my problem is...index file is not getting generated. Please help Erick Erickson wrote: > > Offhand I'd assume that your problem is using PDFbox. Have you > tried printing out the docText string you get back from > > docText = stripper.getText(new PDDocument(cosDoc))? > > I'd recommend you assure yourself that you get valid text back from > the PDF document before worrying about indexing it. > > Best > Erick > > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf >> >> hi all, >> >> i am able to convert a pdf in to a text file using pdfbox. >> and this is the code that I used, but I am not able to index it >> >> // code for parsing and making index >> >> public Document getDocument(InputStream is) >> { >> COSDocument cosDoc = null; >> try { >> PDFParser parser = new PDFParser(is); >> parser.parse(); >> cosDoc = parser.getDocument(); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> String docText = null; >> try { >> PDFTextStripper stripper = new PDFTextStripper(); >> docText = stripper.getText(new PDDocument(cosDoc)); >> } >> catch (IOException e) { >> e.printStackTrace(); >> } >> Document doc = new Document(); >> if (docText != null) { >> doc.add(new Field("body", docText, Field.Store.YES, >> Field.Index.TOKENIZED)); >> } >> return doc; >> } >> >> public static void main(String[] args) throws Exception { >> TestPDFParser handler = new TestPDFParser(); >> >> Document doc = handler.getDocument(new >> FileInputStream(new >> File("D:\\lucenePdf\\DRra0026.pdf"))); >> >> System.out.println(doc); >> >> //Following code is for making index >> >> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf", >> new >> StandardAnalyzer(), true); >> >> f_writer.addDocument(doc); >> >> } >> } >> //code for searching a particular string.. >> >> public static void main(String[] args) throws Exception { >> String indexDir = "D:\\lucenePdf"; >> String q = "RA0083"; >> >> >> Directory fsDir = FSDirectory.getDirectory(indexDir); >> IndexSearcher is = new IndexSearcher(fsDir); >> >> Query query = new QueryParser("body", new >> StandardAnalyzer()).parse(q); >> >> Hits hits = is.search(query); >> System.out.println("Found " + hits.length() + " documents that >> matched query '" + q + "':"); >> for (int i = 0; i < hits.length(); i++) { >> Document doc = hits.doc(i); >> >> } >> } >> >> >> When I run the above code...I get folowing output as a result of running >> indexer class >> >> >> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT >> : RA0083 >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083 >> >> and following files are generated in the specified path.. >> >> segments.gen >> write.lock >> segments_4 >> >> >> but when I run the search class it gives the result as: >> >> Found 0 documents that matched query 'RA0083': >> >> I am also attaching the corresponding pdf file for reference. >> It seems as the index is not getting created.. >> >> Please help me with some of your inputs,it will be very helpfull for me. >> -- >> View this message in context: >> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]