Hi Erick,
Befoe indexing I have printed the doc, and I have given the output also.It
is printing well.
Kindly please check my post again following...
" System.out.println(doc);
//Following code is for making index"
and the corresponding output is...
Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
RA0083
000099062000062000000021000000100220468148001102006PAYOUT : RA0083
000099062000063000000021000000100330468153601102006PAYOUT : RA0083
000099062000064700000021000000100440468155401102006PAYOUT : RA0083
0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
which is as expected...but my problem is...index file is not getting
generated.
Please help
Erick Erickson wrote:
>
> Offhand I'd assume that your problem is using PDFbox. Have you
> tried printing out the docText string you get back from
>
> docText = stripper.getText(new PDDocument(cosDoc))?
>
> I'd recommend you assure yourself that you get valid text back from
> the PDF document before worrying about indexing it.
>
> Best
> Erick
>
> On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>>
>>
>> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>>
>> hi all,
>>
>> i am able to convert a pdf in to a text file using pdfbox.
>> and this is the code that I used, but I am not able to index it
>>
>> // code for parsing and making index
>>
>> public Document getDocument(InputStream is)
>> {
>> COSDocument cosDoc = null;
>> try {
>> PDFParser parser = new PDFParser(is);
>> parser.parse();
>> cosDoc = parser.getDocument();
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> String docText = null;
>> try {
>> PDFTextStripper stripper = new PDFTextStripper();
>> docText = stripper.getText(new PDDocument(cosDoc));
>> }
>> catch (IOException e) {
>> e.printStackTrace();
>> }
>> Document doc = new Document();
>> if (docText != null) {
>> doc.add(new Field("body", docText, Field.Store.YES,
>> Field.Index.TOKENIZED));
>> }
>> return doc;
>> }
>>
>> public static void main(String[] args) throws Exception {
>> TestPDFParser handler = new TestPDFParser();
>>
>> Document doc = handler.getDocument(new
>> FileInputStream(new
>> File("D:\\lucenePdf\\DRra0026.pdf")));
>>
>> System.out.println(doc);
>>
>> //Following code is for making index
>>
>> IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
>> new
>> StandardAnalyzer(), true);
>>
>> f_writer.addDocument(doc);
>>
>> }
>> }
>> //code for searching a particular string..
>>
>> public static void main(String[] args) throws Exception {
>> String indexDir = "D:\\lucenePdf";
>> String q = "RA0083";
>>
>>
>> Directory fsDir = FSDirectory.getDirectory(indexDir);
>> IndexSearcher is = new IndexSearcher(fsDir);
>>
>> Query query = new QueryParser("body", new
>> StandardAnalyzer()).parse(q);
>>
>> Hits hits = is.search(query);
>> System.out.println("Found " + hits.length() + " documents that
>> matched query '" + q + "':");
>> for (int i = 0; i < hits.length(); i++) {
>> Document doc = hits.doc(i);
>>
>> }
>> }
>>
>>
>> When I run the above code...I get folowing output as a result of running
>> indexer class
>>
>>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
>> : RA0083
>> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>>
>> and following files are generated in the specified path..
>>
>> segments.gen
>> write.lock
>> segments_4
>>
>>
>> but when I run the search class it gives the result as:
>>
>> Found 0 documents that matched query 'RA0083':
>>
>> I am also attaching the corresponding pdf file for reference.
>> It seems as the index is not getting created..
>>
>> Please help me with some of your inputs,it will be very helpfull for me.
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]