d e wrote:

> I'm sorry! I guess I was REALLY not clear. I mean my problem is to 
> drop the
> junk *on each page*. I am indexing news sites. I want to harvest news
> STORIES, not the advertisements and other junk text around the outside of
> each page. Got suggestions for THAT problem?


I guess you are controlling the sites you are refering to, right? If so, 
then you might want to add something like

<div id="index">... resp. <div id="no-index">...

whereas I am not sure if "HTML parser" of Nutch is supporting these kind 
of tags.

HTH

Michael

>
> Thanks!
>
>
> On 3/10/07, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:
>
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> There are quite a few ways to do this. In fact, Google's PageRank is
>> one such approach. Text classification (as done in spam filters, for
>> example) is another. It just depends on what you are going to do.
>>
>> d e wrote:
>>
>> > We plan to index many websites. Got any suggestions on how to drop
>> > the junk
>> > without having to do too much work for each such site? Know anyone
>> > who has a
>> > background on doing this sort of thing? What sorts of approaches
>> > would you
>> > recommend?
>>
>> - --
>> Best regards,
>> Bjoern Wilmsmann
>>
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.7 (Darwin)
>>
>> iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
>> 1CyrQfD+5vCzSBvYbviX17o=
>> =+TK/
>> -----END PGP SIGNATURE-----
>>
>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to