Re: getting problem while indexing pdf files with pdfbox

neetika Tue, 17 Jul 2007 10:26:15 -0700

Hi Erick,

Befoe indexing I have printed the doc, and I have given the output also.It
is printing well.
Kindly please check my post again following...


" System.out.println(doc);
                 //Following code is for making index"

and the corresponding output is...

Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT:
RA0083
 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
 0000099062000065700000021000000100550468156201102006PAYOUT : RA0083
which is as expected...but  my problem is...index file is not getting
generated.

Please help



Erick Erickson wrote:
> 
> Offhand I'd assume that your problem is using PDFbox. Have you
> tried printing out the docText string you get  back from
> 
> docText = stripper.getText(new PDDocument(cosDoc))?
> 
> I'd recommend you assure yourself that you get valid text back from
> the PDF document before worrying about indexing it.
> 
> Best
> Erick
> 
> On 7/17/07, neetika <[EMAIL PROTECTED]> wrote:
>>
>>
>> http://www.nabble.com/file/p11647342/DRra0026.pdf DRra0026.pdf
>>
>> hi all,
>>
>> i am able to convert a pdf in to a text file using pdfbox.
>> and this is the code that I used, but I am not able to index it
>>
>> // code for parsing and making index
>>
>>                 public Document getDocument(InputStream is)
>>                 {
>>                 COSDocument cosDoc = null;
>>                 try {
>>                         PDFParser parser = new PDFParser(is);
>>                         parser.parse();
>>                 cosDoc = parser.getDocument();
>>                 }
>>                 catch (IOException e) {
>>                 e.printStackTrace();
>>                                 }
>>                 String docText = null;
>>                 try {
>>                 PDFTextStripper stripper = new PDFTextStripper();
>>                 docText = stripper.getText(new PDDocument(cosDoc));
>>                 }
>>                 catch (IOException e) {
>>                         e.printStackTrace();
>>                 }
>>                 Document doc = new Document();
>>                 if (docText != null) {
>>                 doc.add(new Field("body", docText, Field.Store.YES,
>>                         Field.Index.TOKENIZED));
>>                 }
>>                 return doc;
>>                 }
>>
>>                 public static void main(String[] args) throws Exception {
>>                         TestPDFParser handler = new TestPDFParser();
>>
>>                 Document doc = handler.getDocument(new
>> FileInputStream(new
>> File("D:\\lucenePdf\\DRra0026.pdf")));
>>
>>                 System.out.println(doc);
>>
>>                 //Following code is for making index
>>
>>                 IndexWriter f_writer = new IndexWriter("D:\\lucenePdf",
>> new
>> StandardAnalyzer(), true);
>>
>>                 f_writer.addDocument(doc);
>>
>>                 }
>>                 }
>> //code for searching a particular string..
>>
>> public static void main(String[] args) throws Exception {
>>         String indexDir = "D:\\lucenePdf";
>>         String q = "RA0083";
>>
>>
>>         Directory fsDir = FSDirectory.getDirectory(indexDir);
>>         IndexSearcher is = new IndexSearcher(fsDir);
>>
>>         Query query = new QueryParser("body", new
>> StandardAnalyzer()).parse(q);
>>
>>         Hits hits = is.search(query);
>>         System.out.println("Found " + hits.length() + " documents that
>> matched query '" + q + "':");
>>         for (int i = 0; i < hits.length(); i++) {
>>             Document doc = hits.doc(i);
>>
>>         }
>>     }
>>
>>
>> When I run the above code...I get folowing output as a result of running
>> indexer class
>>
>>
>> Document<stored/uncompressed,indexed,tokenized<body:000099062000061300000021000000100110468147201102006PAYOUT
>> : RA0083
>> 000099062000062000000021000000100220468148001102006PAYOUT : RA0083
>> 000099062000063000000021000000100330468153601102006PAYOUT : RA0083
>> 000099062000064700000021000000100440468155401102006PAYOUT : RA0083
>> 000099062000065700000021000000100550468156201102006PAYOUT : RA0083
>>
>> and following  files are generated in the specified path..
>>
>> segments.gen
>> write.lock
>> segments_4
>>
>>
>> but when I run the search class it gives the result as:
>>
>> Found 0 documents that matched query 'RA0083':
>>
>> I am also attaching the corresponding pdf file for reference.
>> It seems as the index is not getting created..
>>
>> Please help me with some of your inputs,it will be very helpfull for me.
>> --
>> View this message in context:
>> http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11647342
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/getting-problem-while-indexing-pdf-files-with-pdfbox-tf4096205.html#a11653883
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting problem while indexing pdf files with pdfbox

Reply via email to