That's pretty interesting because I've done something similar, but I (vainly) try to build some intelligence into it.
First I run the text through a regex which filters out most of the words which contain nonsense characters (^A-Za-z0-9_@). Then I run it through a couple of tests, which try to guess if this is an English word, like testing for 3 consonants in a row, and if there are 2 numbers which are not next to each other (can you think of any?!?!). Based on these, I rank the documents. Finally, if the word is less than 5 characters, I run it through a wordlist. The results are <emphasis>acceptable</emphasis>. The difficulty is that when faced with large documents (>5MB), you end up with alot of nonsense terms, actually exceeding Lucene's inbuilt limit of 10000 terms (to limit memory usage). The result is that these documents are not indexed completely. I'd be interested to see how you filter your documents...:) ----- Original Message ----- From: Cecil, Paula New <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]>; Kelvin Tan <[EMAIL PROTECTED]> Sent: Saturday, November 24, 2001 12:36 AM Subject: Re: PDF parser for Lucene > Inspired by the Unix "strings" command, I have written a subclass of > FilterReader; which I have called BinaryReader. The idea is simply to index > any proprietary file format by filtering out all non-printable characters. > The assumption is that text is text. It will end up with more than the > "visible" text, but not less. After I have tested and made some examples I > will post it here. > > > > ----- Original Message ----- > From: Kelvin Tan <[EMAIL PROTECTED]> > To: Lucene Users List <[EMAIL PROTECTED]> > Sent: Friday, November 23, 2001 2:48 AM > Subject: Re: PDF parser for Lucene > > > > I'm not too familiar with websearch's PDF parsing. > > > > I use a nice API Etymon Pj http://www.etymon.com/pj/ > > > > It doesn't come with the ability to extract text, but it can be coded. > I'll > > leave you to do it because it's kinda fun, but I could provide it if > anyone > > wants it. > > > > I've also implemented it so that the searches can be performed on a > > page-by-page basis. That's pretty cool, i think. > > > > ----- Original Message ----- > > From: <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Cc: <[EMAIL PROTECTED]> > > Sent: Friday, November 23, 2001 4:39 PM > > Subject: RE: PDF parser for Lucene > > > > > > > Hello, > > > > > > We have been using PDFHandler - a pdf parser provided by websearch, to > > > search in pdf files. We are trying to get the contents using > > > pdfHandler.getContents() to arrive at a context-sensitive summary. > > However, > > > it gives some yen signs and other special symbols in the title, summary > > and > > > contents. If anyone is using the websearch component to parse pdf files > > and > > > have encountered this problem, kindly give your suggestions. > > > > > > Note - Most of the pdf files are using WinAnsiEncoding, and setting the > > > encoding as Win-12xx doesn't help. > > > > > > Thanks in advance, > > > > > > Sampreet > > > Programmer > > > > > > > > > You could try this one: > > > http://www.i2a.com/websearch/ > > > > > > ...and then tell me how it works for you. > > > =:o) > > > > > > > > > Anyway, it is simple and Open Source. > > > > > > > > > Have fun, > > > Paulo Gaspar > > > > > > > > > -- > > > To unsubscribe, e-mail: > > <mailto:[EMAIL PROTECTED]> > > > For additional commands, e-mail: > > <mailto:[EMAIL PROTECTED]> > > > > > > > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
