I'm sorry! I guess I was REALLY not clear. I mean my problem is to drop the
junk *on each page*. I am indexing news sites. I want to harvest news
STORIES, not the advertisements and other junk text around the outside of
each page. Got suggestions for THAT problem?
Thanks!
On 3/10/07, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
There are quite a few ways to do this. In fact, Google's PageRank is
one such approach. Text classification (as done in spam filters, for
example) is another. It just depends on what you are going to do.
d e wrote:
> We plan to index many websites. Got any suggestions on how to drop
> the junk
> without having to do too much work for each such site? Know anyone
> who has a
> background on doing this sort of thing? What sorts of approaches
> would you
> recommend?
- --
Best regards,
Bjoern Wilmsmann
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
1CyrQfD+5vCzSBvYbviX17o=
=+TK/
-----END PGP SIGNATURE-----
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers