Re: [Nutch-dev] Indexing the Interesting Part Only...

d e Sat, 10 Mar 2007 12:08:15 -0800

We plan to index many websites. Got any suggestions on how to drop the junk
without having to do too much work for each such site? Know anyone who has a
background on doing this sort of thing? What sorts of approaches would you
recommend?


Are there existing plug ins I should consider using?


On 3/9/07, J. Delgado <[EMAIL PROTECTED]> wrote:


You have to build a special HTML Junk parser.

2007/3/9, d e <[EMAIL PROTECTED]>:
>
> If I'm indexing a news article, I want to avoid getting the junk (other
> than
> the title, auther and article) into the index. I want to avoid getting
the
> advertizments, etc. How do I do that sort of thing?
>
> What parts of what manual should I be reading so I will know how to do
> this
> sort of thing.
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Indexing the Interesting Part Only...

Reply via email to