Michael Wechner wrote:

> d e wrote:
>
>> I'm sorry! I guess I was REALLY not clear. I mean my problem is to 
>> drop the
>> junk *on each page*. I am indexing news sites. I want to harvest news
>> STORIES, not the advertisements and other junk text around the 
>> outside of
>> each page. Got suggestions for THAT problem?
>
>
>
> I guess you are controlling the sites you are refering to, right? If 
> so, then you might want to add something like
>
> <div id="index">... resp. <div id="no-index">...
>
> whereas I am not sure if "HTML parser" of Nutch is supporting these 
> kind of tags.


it just comes to my mind that I still think the best would be something like

<html>
  <head>
    <link rel="search" href="search-foo.xml" type="application/search+xml"/>

whereas the search-foo.xml would contain all the data which should be 
indexed by whatever search engine.

AFAIK no such standard exists (whereas one might want to consider RDF).

Cheers

Michael

>  
> HTH
>
> Michael
>
>>
>> Thanks!
>>
>>
>> On 3/10/07, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> There are quite a few ways to do this. In fact, Google's PageRank is
>>> one such approach. Text classification (as done in spam filters, for
>>> example) is another. It just depends on what you are going to do.
>>>
>>> d e wrote:
>>>
>>> > We plan to index many websites. Got any suggestions on how to drop
>>> > the junk
>>> > without having to do too much work for each such site? Know anyone
>>> > who has a
>>> > background on doing this sort of thing? What sorts of approaches
>>> > would you
>>> > recommend?
>>>
>>> - --
>>> Best regards,
>>> Bjoern Wilmsmann
>>>
>>>
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v1.4.7 (Darwin)
>>>
>>> iD8DBQFF812mgz0R1bg11MERAqXCAKCVTfLN7KXJYdAqLGWMI57ChKaM8QCfdQBc
>>> 1CyrQfD+5vCzSBvYbviX17o=
>>> =+TK/
>>> -----END PGP SIGNATURE-----
>>>
>>
>>
>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to