I'm not too familiar with websearch's PDF parsing. I use a nice API Etymon Pj http://www.etymon.com/pj/
It doesn't come with the ability to extract text, but it can be coded. I'll leave you to do it because it's kinda fun, but I could provide it if anyone wants it. I've also implemented it so that the searches can be performed on a page-by-page basis. That's pretty cool, i think. ----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, November 23, 2001 4:39 PM Subject: RE: PDF parser for Lucene > Hello, > > We have been using PDFHandler - a pdf parser provided by websearch, to > search in pdf files. We are trying to get the contents using > pdfHandler.getContents() to arrive at a context-sensitive summary. However, > it gives some yen signs and other special symbols in the title, summary and > contents. If anyone is using the websearch component to parse pdf files and > have encountered this problem, kindly give your suggestions. > > Note - Most of the pdf files are using WinAnsiEncoding, and setting the > encoding as Win-12xx doesn't help. > > Thanks in advance, > > Sampreet > Programmer > > > You could try this one: > http://www.i2a.com/websearch/ > > ...and then tell me how it works for you. > =:o) > > > Anyway, it is simple and Open Source. > > > Have fun, > Paulo Gaspar > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
