Re: Searching parameterized URLs

ogjunk-nutch Wed, 30 Apr 2008 11:01:15 -0700

JSP pages typically render HTML, so you don't need a JSP plugin, but an 
parse-html plugin in your nutch-site.xml


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Jasper Kamperman <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, April 30, 2008 1:32:29 PM
> Subject: Re: Searching parameterized URLs
> 
> I think the first question is to figure out whether the page with URL  
> http://www.somesite.com/somepage.jsp?id=someId even made it into your  
> index. There are several ways to do this, personally I tend to use  
> luke to have a look at the index, tell luke to open your nutch-0.9/ 
> crawl/index directory (which is where it ends up if you follow the  
> default instructions for running the crawl).
> 
> If the page is in your index you can use luke to see what fields were  
> extracted, hopefully there is some field named "foo" which would have  
> "xyz" somewhere. The Nutch demo app should then find the page if you  
> specify foo:xyz in the searchbar. If "foo" is one of "content",  
> "title", "anchor" or "url" then the demo app should find it if you  
> plainly search for xyz, no need to specify any of the default fields.
> 
> Since it is a jsp page, it is entirely possible that you either don't  
> have the correct (jsp) plugin configured or that the plugin you have  
> isn't smart enough to get the content out of a jsp page.
> 
> Jasper
> 
> On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote:
> 
> > Hi,
> >
> > I'm a nutch-newbie and am developing a search-based website.
> >
> > How can I use Nutch to search for parameterized URLs?
> >
> > e.g. I want to search on an item called "xyz". The information on  
> > this item
> > is available on http://www.somesite.com/somepage.jsp?id=someId
> > where someId is the databaseId (generated by the host application)  
> > for item
> > "xyz".
> >
> >  I know that item "xyz" shows up with the above URL when I search  
> > using
> > Google but it doesn't appear when I search for it using the sample web
> > application provided with nutch.
> >
> > *Configuration:*
> >
> > I have configured the crawl-urlfilter.txt to :
> >
> > # accept hosts in MY.DOMAIN.NAME
> > *+^http://([a-z0-9]*\.)*somesite.com/*
> >
> > My *urls* folder contains a text file containing :
> > *http://www.somesite.com*
> >
> > and I executed the command: *bin/nutch crawl urls -dir crawldir - 
> > depth 3*
> >
> > How can I get: http://www.somesite.com/somepage.jsp?id=someId when  
> > I search
> > for "xyz" the same way it shows up during a Google search?
> >
> > Your help would be much appreciated,
> > Rohit
> 
>

Re: Searching parameterized URLs

Reply via email to