hi,
i think the problem is that the content comes from a database and not from
a file?
So the question is how to index a databse with nutch?
regards,
robert
Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
08.09.2005 10:46
Bitte antworten an nutch-user
An: [email protected], [EMAIL PROTECTED]
Kopie:
Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic news?
Hi,
You need to remove the '?' and the '=' from the following pattern:
[EMAIL PROTECTED]
Regards,
Sebastien.
--- mu xiaofeng <[EMAIL PROTECTED]> a écrit :
> hi ,
>
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
>
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345 .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
>
> # skip everything else
> -.
> -----------------------------------------
>
> Thx all,
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com