Re: xpdf parser usage for lucene

Pinky Iyer Tue, 25 Feb 2003 14:26:02 -0800

THis means that i have to use the htmlparser again on the converted document. Is that 
right? Also is there a way to use these without utilizing the filesystem, by way of 
streams or so.
 Michael Wechner <[EMAIL PROTECTED]> wrote:Pinky Iyer wrote:


>Hi !
> I am trying to use xpdf for pdf parser, the problem i encounter is when i encounter 
> a file with .pdf extension, i call the pdftotext script to convert to text, which in 
> turn uses the file system and leaves the same file with .txt extension in same dir. 
> How can i get this as a stream and not use the file system at all. Also How do i 
> access the summary and title info.
>

xpdf has an option to turn the PDF into an HTML instead of txt, which 
allows you to use an HTMLParser
for populating the fields.

Concerning the extension: when you create your Lucene document, you 
could replace the txt extension
by the pdf extension in the case of the "uri" field.

HTH

Michael

> Anybody who has done this before, please help!
>Thanks!
>Pinky Iyer
> 
>
>
>
>---------------------------------
>Do you Yahoo!?
>Yahoo! Tax Center - forms, calculators, tips, and more
> 
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, and more

Re: xpdf parser usage for lucene

Reply via email to