shrinath.m <[email protected]> wrote: > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ?
In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages by selecting only text between certain tags, before indexing them. These are offline Web pages, as in your application. Take a look at <http://uplib.parc.com/hg/uplib/file/2a204fc2dd1a/extensions/FilterWebPage.py>. Bill --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
