hi,

i think the problem is that the content comes from a database and not from 
a file?
So the question is how to index a databse with nutch?

regards,
robert





Sébastien LE CALLONNEC <[EMAIL PROTECTED]>

08.09.2005 10:46
Bitte antworten an nutch-user
 
        An:     [email protected], [EMAIL PROTECTED]
        Kopie: 
        Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic news?


Hi, 

You need to remove the '?' and the '=' from the following pattern:
[EMAIL PROTECTED]

Regards,
Sebastien.


--- mu xiaofeng <[EMAIL PROTECTED]> a écrit :

> hi ,
> 
> I'm use Nutch 0.7 crawler to fetch my site ,
> but it only fetch the static html files like :
> xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> 
> How can I use it to fetch the dynamic news
> ex: http://mysite.com/news.asp?id=12345  .?
> my crawl-urlfilter.txt content is
> -----------------------------------------
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://mysite.com/
> 
> # skip everything else
> -.
> -----------------------------------------
> 
> Thx all,
> 



 

 
 
___________________________________________________________________________ 

Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 

Téléchargez cette version sur http://fr.messenger.yahoo.com


Reply via email to