We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?
Are there existing plug ins I should consider using?
On 3/9/07, J. Delgado <[EMAIL PROTECTED]> wrote:
You have to build a special HTML Junk parser.
2007/3/9, d e <[EMAIL PROTECTED]>:
>
> If I'm indexing a news article, I want to avoid getting the junk (other
> than
> the title, auther and article) into the index. I want to avoid getting
the
> advertizments, etc. How do I do that sort of thing?
>
> What parts of what manual should I be reading so I will know how to do
> this
> sort of thing.
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers