hi,
sorry it was my fault.
Of course nutch indexes all URLs, pages and reads the file as text/html.
So you are right :-)
I'm quite new to nutch (first day was yesterday :-).
regards
robert
Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
08.09.2005 12:35
Bitte antworten an nutch-user
An: [email protected]
Kopie:
Thema: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the
Dynamic news?
Hi,
I am not too sure what you're saying... The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.
Or is your question related to another matter altogether?
Regards,
Sebastien.
--- [EMAIL PROTECTED] a écrit :
> hi,
>
> i think the problem is that the content comes from a database and not
> from
> a file?
> So the question is how to index a databse with nutch?
>
> regards,
> robert
>
>
>
>
>
> Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
>
> 08.09.2005 10:46
> Bitte antworten an nutch-user
>
> An: [email protected], [EMAIL PROTECTED]
> Kopie:
> Thema: RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
>
>
> Hi,
>
> You need to remove the '?' and the '=' from the following pattern:
> [EMAIL PROTECTED]
>
> Regards,
> Sebastien.
>
>
> --- mu xiaofeng <[EMAIL PROTECTED]> a écrit :
>
> > hi ,
> >
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp , xxx.php , xxx.js
> >
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345 .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> >
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> >
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'. The first matching pattern in the file
> > # determines whether a URL is included or ignored. If no pattern
> > # matches, the URL is ignored.
> >
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > [EMAIL PROTECTED]
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> >
> > # skip everything else
> > -.
> > -----------------------------------------
> >
> > Thx all,
> >
>
>
>
>
>
>
>
>
___________________________________________________________________________
>
>
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger
>
> Téléchargez cette version sur http://fr.messenger.yahoo.com
>
>
>
___________________________________________________________________________
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
Téléchargez cette version sur http://fr.messenger.yahoo.com