[Nutch-general] Antwort: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Robert . Guggenberger Thu, 08 Sep 2005 04:22:14 -0700

hi,

sorry it was my fault.


Of course nutch indexes all URLs, pages and reads the file as text/html. 
So you are right :-)

I'm quite new to nutch (first day was yesterday :-). 

regards
robert




Sébastien LE CALLONNEC <[EMAIL PROTECTED]>

08.09.2005 12:35
Bitte antworten an nutch-user
 
        An:     [email protected]
        Kopie: 
        Thema:  RE: Antwort: RE: How can I use Nutch 0.7 to crawl the 
Dynamic news?


Hi,


I am not too sure what you're saying...  The ASP pages may be built
from data pulled out from a database, but at the end of the day, what
the browser displays is of text/html content-type, which can be indexed
by Nutch.

Or is your question related to another matter altogether?


Regards,
Sebastien.

--- [EMAIL PROTECTED] a écrit :

> hi,
> 
> i think the problem is that the content comes from a database and not
> from 
> a file?
> So the question is how to index a databse with nutch?
> 
> regards,
> robert
> 
> 
> 
> 
> 
> Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
> 
> 08.09.2005 10:46
> Bitte antworten an nutch-user
> 
>         An:     [email protected], [EMAIL PROTECTED]
>         Kopie: 
>         Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic
> news?
> 
> 
> Hi, 
> 
> You need to remove the '?' and the '=' from the following pattern:
> [EMAIL PROTECTED]
> 
> Regards,
> Sebastien.
> 
> 
> --- mu xiaofeng <[EMAIL PROTECTED]> a écrit :
> 
> > hi ,
> > 
> > I'm use Nutch 0.7 crawler to fetch my site ,
> > but it only fetch the static html files like :
> > xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> > 
> > How can I use it to fetch the dynamic news
> > ex: http://mysite.com/news.asp?id=12345  .?
> > my crawl-urlfilter.txt content is
> > -----------------------------------------
> > # The url filter file used by the crawl command.
> > 
> > # Better for intranet crawling.
> > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > 
> > # Each non-comment, non-blank line contains a regular expression
> > # prefixed by '+' or '-'.  The first matching pattern in the file
> > # determines whether a URL is included or ignored.  If no pattern
> > # matches, the URL is ignored.
> > 
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> > 
> > # skip image and other suffixes we can't yet parse
> >
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > 
> > # skip URLs containing certain characters as probable queries, etc.
> > [EMAIL PROTECTED]
> > 
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://mysite.com/
> > 
> > # skip everything else
> > -.
> > -----------------------------------------
> > 
> > Thx all,
> > 
> 
> 
> 
> 
> 
> 
> 
>
___________________________________________________________________________
> 
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> Messenger 
> 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 
> 
> 



 

 
 
___________________________________________________________________________ 

Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 

Téléchargez cette version sur http://fr.messenger.yahoo.com

[Nutch-general] Antwort: RE: Antwort: RE: How can I use Nutch 0.7 to crawl the Dynamic news?

Reply via email to