hi ,all,

I'm sorry for my bad description ,I knew that nutch can index text/html files,

my problem is ,The Nutch crawler only fetch the url like
http://mysite.com/test_sample/test.html , It skipped all the urls like
http://mysite.com/test_news/news.asp?newsid=123xx ,
How can I make it to fetch these url ? 

2005/9/8, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
> hi,
> 
> sorry it was my fault.
> 
> Of course nutch indexes all URLs, pages and reads the file as text/html.
> So you are right :-)
> 
> I'm quite new to nutch (first day was yesterday :-).
> 
> regards
> robert
> 
> 
> 
> 
> Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
> 
> 08.09.2005 12:35
> Bitte antworten an nutch-user
> 
>        An:     [email protected]
>        Kopie:
>        Thema:  RE: Antwort: RE: How can I use Nutch 0.7 to crawl the
> Dynamic news?
> 
> 
> Hi,
> 
> 
> I am not too sure what you're saying...  The ASP pages may be built
> from data pulled out from a database, but at the end of the day, what
> the browser displays is of text/html content-type, which can be indexed
> by Nutch.
> 
> Or is your question related to another matter altogether?
> 
> 
> Regards,
> Sebastien.
> 
> --- [EMAIL PROTECTED] a écrit :
> 
> > hi,
> >
> > i think the problem is that the content comes from a database and not
> > from
> > a file?
> > So the question is how to index a databse with nutch?
> >
> > regards,
> > robert
> >
> >
> >
> >
> >
> > Sébastien LE CALLONNEC <[EMAIL PROTECTED]>
> >
> > 08.09.2005 10:46
> > Bitte antworten an nutch-user
> >
> >         An:     [email protected], [EMAIL PROTECTED]
> >         Kopie:
> >         Thema:  RE: How can I use Nutch 0.7 to crawl the Dynamic
> > news?
> >
> >
> > Hi,
> >
> > You need to remove the '?' and the '=' from the following pattern:
> > [EMAIL PROTECTED]
> >
> > Regards,
> > Sebastien.
> >
> >
> > --- mu xiaofeng <[EMAIL PROTECTED]> a écrit :
> >
> > > hi ,
> > >
> > > I'm use Nutch 0.7 crawler to fetch my site ,
> > > but it only fetch the static html files like :
> > > xxx.htm , xxx.html , xxx.asp ,  xxx.php , xxx.js
> > >
> > > How can I use it to fetch the dynamic news
> > > ex: http://mysite.com/news.asp?id=12345  .?
> > > my crawl-urlfilter.txt content is
> > > -----------------------------------------
> > > # The url filter file used by the crawl command.
> > >
> > > # Better for intranet crawling.
> > > # Be sure to change MY.DOMAIN.NAME to your domain name.
> > >
> > > # Each non-comment, non-blank line contains a regular expression
> > > # prefixed by '+' or '-'.  The first matching pattern in the file
> > > # determines whether a URL is included or ignored.  If no pattern
> > > # matches, the URL is ignored.
> > >
> > > # skip file:, ftp:, & mailto: urls
> > > -^(file|ftp|mailto):
> > >
> > > # skip image and other suffixes we can't yet parse
> > >
> >
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> > >
> > > # skip URLs containing certain characters as probable queries, etc.
> > > [EMAIL PROTECTED]
> > >
> > > # accept hosts in MY.DOMAIN.NAME
> > > +^http://mysite.com/
> > >
> > > # skip everything else
> > > -.
> > > -----------------------------------------
> > >
> > > Thx all,
> > >
> >
> >
> >
> >
> >
> >
> >
> >
> ___________________________________________________________________________
> >
> >
> > Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
> > Messenger
> >
> > Téléchargez cette version sur http://fr.messenger.yahoo.com
> >
> >
> >
> 
> 
> 
> 
> 
> 
> 
> ___________________________________________________________________________
> 
> Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger
> 
> Téléchargez cette version sur http://fr.messenger.yahoo.com
> 
> 
> 
>


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to