You might want to take a look at WebSearch http://www.i2a.com/websearch/. It
has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF
but a guy I contacted there says there's no current support for text
extraction but that he's "planning to do it".

Definitely agreed on the PJ resources bit. Doesn't really scale well in
terms of PDF file size.

If you haven't already seen the post, I once did a cursory examination of
the options for extracting text from PDF files via Java and the limitations
of the approaches.
http://www.mail-archive.com/[email protected]/msg00280.html

The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as
the libs I've seen so far, most of them are really concerned with the
display and manipulation of PDF pages. Since we're looking for something
less complex (i.e text extraction), maybe it's not so bad. I've spent abit
of time in this area before so feel free to email me offline about this. Not
sure how much help I can be though.

----- Original Message -----
From: "petite_abeille" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, May 03, 2002 10:57 PM
Subject: Re: indexing PDF files


> On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:
>
> > Can I assume none of the poeple on the lucene user group had
> > implemented indexing a pdf document using lucene.
>
> Who knows...?!? In any case, it's not public knowledge...
>
> >  If some one has.. Please help me by providing the solution.
>
> I use to believe in Santa Claus also... ;-)
>
> All that said, there seems to be a real demand to do something about pdf
> to text conversion (in java preferably). I'm willing to invest some time
> and brain cell to nail it down, but I'm note sure where to start...
>
> I'm aware of the PJ library, but it's really a pig as far as resources
> goes. Anything else?
>
> Any (concrete) pointer appreciated.
>
> Thanks.
>
> PA.
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to