Inspired by the Unix "strings" command, I have written a subclass of FilterReader; which I have called BinaryReader. The idea is simply to index any proprietary file format by filtering out all non-printable characters. The assumption is that text is text. It will end up with more than the "visible" text, but not less. After I have tested and made some examples I will post it here.
----- Original Message ----- From: Kelvin Tan <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]> Sent: Friday, November 23, 2001 2:48 AM Subject: Re: PDF parser for Lucene > I'm not too familiar with websearch's PDF parsing. > > I use a nice API Etymon Pj http://www.etymon.com/pj/ > > It doesn't come with the ability to extract text, but it can be coded. I'll > leave you to do it because it's kinda fun, but I could provide it if anyone > wants it. > > I've also implemented it so that the searches can be performed on a > page-by-page basis. That's pretty cool, i think. > > ----- Original Message ----- > From: <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Friday, November 23, 2001 4:39 PM > Subject: RE: PDF parser for Lucene > > > > Hello, > > > > We have been using PDFHandler - a pdf parser provided by websearch, to > > search in pdf files. We are trying to get the contents using > > pdfHandler.getContents() to arrive at a context-sensitive summary. > However, > > it gives some yen signs and other special symbols in the title, summary > and > > contents. If anyone is using the websearch component to parse pdf files > and > > have encountered this problem, kindly give your suggestions. > > > > Note - Most of the pdf files are using WinAnsiEncoding, and setting the > > encoding as Win-12xx doesn't help. > > > > Thanks in advance, > > > > Sampreet > > Programmer > > > > > > You could try this one: > > http://www.i2a.com/websearch/ > > > > ...and then tell me how it works for you. > > =:o) > > > > > > Anyway, it is simple and Open Source. > > > > > > Have fun, > > Paulo Gaspar > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
