Hi Erick, I am able to get the result fine. The problem was, I forgot to close the writer and so the index file (.cfs) was not getting generated.
Thanks a lot for the timely help. Regards, Neetika Erick Erickson wrote: > > You have NOT supplied an example of the text you extracted > from the document. But let's assume that the interesting > string is exactly what you expect. > > Have you looked at your index with Luke to see if the data is there? > I *strongly* suggest you get a copy of Luke (google lucene luke) to > examine indexes with. > > The existence of the write.lock file suggests that you haven't closed your > index prior to searching it. Although flushing it would probably work. Be > aware that you cannot see changes to an index if the reader you use > has been opened before the indexing operation. Also, there is some period > of time when the indexed data is buffered by the writer, and I'm unsure > (but > doubt) it's available until it's been flushed. > > I suspect that your problem is not related to PDF, but rather to whether > you've properly indexed data and closed your index prior to searching > it..... > > The other possibility is that your analyzer is parsing things > "interestingly". > StandardAnalyzer() does some interesting things when tokenizing, including > lowercasing the input stream. Although that shouldn't have been a problem > since you use the same analyzer for indexing and searching. > > Also, try query.toString to see what is actually searched, that often > gives > insights. The aforementioned Luke will allow you to submit queries to the > index, including explaining what the actual query produced is. > What are the file sizes of your index files? > > > Best > Erick > > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> Hi Erick, >> >> Befoe indexing I have printed the doc, and I have given the output >> also.It >> is printing well. >> Kindly please check my post again following... >> >> " System.out.println(doc); >> //Following code is for making index" >> >> and the corresponding output is... >> >> >> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT: >> RA0083 >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 >> 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083 >> which is as expected...but my problem is...index file is not getting >> generated. >> >> Please help >> >> >> >> Erick Erickson wrote: >> > >> > Offhand I'd assume that your problem is using PDFbox. Have you >> > tried printing out the docText string you get back from >> > >> > docText = stripper.getText(new PDDocument(cosDoc))? >> > >> > I'd recommend you assure yourself that you get valid text back from >> > the PDF document before worrying about indexing it. >> > >> > Best >> > Erick >> > >> > On 7/17/07, neetika <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf >> >> >> >> hi all, >> >> >> >> i am able to convert a pdf in to a text file using pdfbox. >> >> and this is the code that I used, but I am not able to index it >> >> >> >> // code for parsing and making index >> >> >> >> public Document getDocument(InputStream is) >> >> { >> >> COSDocument cosDoc = null; >> >> try { >> >> PDFParser parser = new PDFParser(is); >> >> parser.parse(); >> >> cosDoc = parser.getDocument(); >> >> } >> >> catch (IOException e) { >> >> e.printStackTrace(); >> >> } >> >> String docText = null; >> >> try { >> >> PDFTextStripper stripper = new PDFTextStripper(); >> >> docText = stripper.getText(new PDDocument(cosDoc)); >> >> } >> >> catch (IOException e) { >> >> e.printStackTrace(); >> >> } >> >> Document doc = new Document(); >> >> if (docText != null) { >> >> doc.add(new Field("body", docText, Field.Store.YES, >> >> Field.Index.TOKENIZED)); >> >> } >> >> return doc; >> >> } >> >> >> >> public static void main(String[] args) throws >> Exception >> { >> >> TestPDFParser handler = new TestPDFParser(); >> >> >> >> Document doc = handler.getDocument(new >> >> FileInputStream(new >> >> File("D:\\lucenePdf\\DRra0026.pdf"))); >> >> >> >> System.out.println(doc); >> >> >> >> //Following code is for making index >> >> >> >> IndexWriter f_writer = new >> IndexWriter("D:\\lucenePdf", >> >> new >> >> StandardAnalyzer(), true); >> >> >> >> f_writer.addDocument(doc); >> >> >> >> } >> >> } >> >> //code for searching a particular string.. >> >> >> >> public static void main(String[] args) throws Exception { >> >> String indexDir = "D:\\lucenePdf"; >> >> String q = "RA0083"; >> >> >> >> >> >> Directory fsDir = FSDirectory.getDirectory(indexDir); >> >> IndexSearcher is = new IndexSearcher(fsDir); >> >> >> >> Query query = new QueryParser("body", new >> >> StandardAnalyzer()).parse(q); >> >> >> >> Hits hits = is.search(query); >> >> System.out.println("Found " + hits.length() + " documents that >> >> matched query '" + q + "':"); >> >> for (int i = 0; i < hits.length(); i++) { >> >> Document doc = hits.doc(i); >> >> >> >> } >> >> } >> >> >> >> >> >> When I run the above code...I get folowing output as a result of >> running >> >> indexer class >> >> >> >> >> >> >> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT >> >> : RA0083 >> >> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083 >> >> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083 >> >> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083 >> >> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083 >> >> >> >> and following files are generated in the specified path.. >> >> >> >> segments.gen >> >> write.lock >> >> segments_4 >> >> >> >> >> >> but when I run the search class it gives the result as: >> >> >> >> Found 0 documents that matched query 'RA0083': >> >> >> >> I am also attaching the corresponding pdf file for reference. >> >> It seems as the index is not getting created.. >> >> >> >> Please help me with some of your inputs,it will be very helpfull for >> me. >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342 >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11662355 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]