Inspired by the Unix "strings" command, I have written a subclass of
FilterReader; which I have called BinaryReader.  The idea is simply to index
any proprietary file format by filtering out all non-printable characters.
The assumption is that text is text.  It will end up with more than the
"visible" text, but not less.  After I have tested and made some examples I
will post it here.



----- Original Message -----
From: Kelvin Tan <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Sent: Friday, November 23, 2001 2:48 AM
Subject: Re: PDF parser for Lucene


> I'm not too familiar with websearch's PDF parsing.
>
> I use a nice API Etymon Pj http://www.etymon.com/pj/
>
> It doesn't come with the ability to extract text, but it can be coded.
I'll
> leave you to do it because it's kinda fun, but I could provide it if
anyone
> wants it.
>
> I've also implemented it so that the searches can be performed on a
> page-by-page basis. That's pretty cool, i think.
>
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Friday, November 23, 2001 4:39 PM
> Subject: RE: PDF parser for Lucene
>
>
> > Hello,
> >
> > We have been using PDFHandler - a pdf parser provided by websearch, to
> > search in pdf files. We are trying to get the contents using
> > pdfHandler.getContents() to arrive at a context-sensitive summary.
> However,
> > it gives some yen signs and other special symbols in the title, summary
> and
> > contents. If anyone is using the websearch component to parse pdf files
> and
> > have encountered this problem, kindly give your suggestions.
> >
> > Note - Most of the pdf files are using WinAnsiEncoding, and setting the
> > encoding as Win-12xx doesn't help.
> >
> > Thanks in advance,
> >
> > Sampreet
> > Programmer
> >
> >
> > You could try this one:
> > http://www.i2a.com/websearch/
> >
> > ...and then tell me how it works for you.
> > =:o)
> >
> >
> > Anyway, it is simple and Open Source.
> >
> >
> > Have fun,
> > Paulo Gaspar
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to