You might want to take a look at WebSearch http://www.i2a.com/websearch/. It has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF but a guy I contacted there says there's no current support for text extraction but that he's "planning to do it".
Definitely agreed on the PJ resources bit. Doesn't really scale well in terms of PDF file size. If you haven't already seen the post, I once did a cursory examination of the options for extracting text from PDF files via Java and the limitations of the approaches. http://www.mail-archive.com/[email protected]/msg00280.html The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as the libs I've seen so far, most of them are really concerned with the display and manipulation of PDF pages. Since we're looking for something less complex (i.e text extraction), maybe it's not so bad. I've spent abit of time in this area before so feel free to email me offline about this. Not sure how much help I can be though. ----- Original Message ----- From: "petite_abeille" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, May 03, 2002 10:57 PM Subject: Re: indexing PDF files > On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote: > > > Can I assume none of the poeple on the lucene user group had > > implemented indexing a pdf document using lucene. > > Who knows...?!? In any case, it's not public knowledge... > > > If some one has.. Please help me by providing the solution. > > I use to believe in Santa Claus also... ;-) > > All that said, there seems to be a real demand to do something about pdf > to text conversion (in java preferably). I'm willing to invest some time > and brain cell to nail it down, but I'm note sure where to start... > > I'm aware of the PJ library, but it's really a pig as far as resources > goes. Anything else? > > Any (concrete) pointer appreciated. > > Thanks. > > PA. > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
