sorry... i was replying to Jasper's comments on searching the index... Any help regarding my last reply to this chain (my alternative approach)?
also, please ignore the *...* surrounding the nutch configuration entries in the previous email.. (I guess Rich Text mail is not supported :)) e.g. read *+^http://([a-z0-9]*\.)*somesite.com/* as just +^http:// ([a-z0-9]*\.)*somesite.com/ Waiting for a reply.. Rohit On 4/30/08, Rohit Potnis <[EMAIL PROTECTED]> wrote: > > Thanks for your replies.. > > @Otis: > > Continuing from our previous email exchange: > > The "xyz" value was not in my list of indexes. > > So I tried an alternative: > > in my urls folder, I changed the url in the urls folder to: > http://www.somesite.com/somepage.jsp?id=someId > hoping that this would fetch only one URL. > > my crawl-urlfilter.txt was configured for: > # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/> > *+^http://([a-z0-9]*\.)*somesite.com/* > > and I executed the command: *bin/nutch crawl urls -dir crawldir - depth 10 > * > > This, however, fetched 0 records. > > So now I'm wondering if my alternative was correct? If not, can you please > help me understand the right way to search this? > > thanks much, > Rohit > > On 4/30/08, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > > JSP pages typically render HTML, so you don't need a JSP plugin, but an > > parse-html plugin in your nutch-site.xml > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > > From: Jasper Kamperman <[EMAIL PROTECTED]> > > > To: [email protected] > > > Sent: Wednesday, April 30, 2008 1:32:29 PM > > > Subject: Re: Searching parameterized URLs > > > > > > I think the first question is to figure out whether the page with URL > > > http://www.somesite.com/somepage.jsp?id=someId even made it into your > > > index. There are several ways to do this, personally I tend to use > > > luke to have a look at the index, tell luke to open your nutch-0.9/ > > > crawl/index directory (which is where it ends up if you follow the > > > default instructions for running the crawl). > > > > > > If the page is in your index you can use luke to see what fields were > > > extracted, hopefully there is some field named "foo" which would have > > > "xyz" somewhere. The Nutch demo app should then find the page if you > > > specify foo:xyz in the searchbar. If "foo" is one of "content", > > > "title", "anchor" or "url" then the demo app should find it if you > > > plainly search for xyz, no need to specify any of the default fields. > > > > > > Since it is a jsp page, it is entirely possible that you either don't > > > have the correct (jsp) plugin configured or that the plugin you have > > > isn't smart enough to get the content out of a jsp page. > > > > > > Jasper > > > > > > On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote: > > > > > > > Hi, > > > > > > > > I'm a nutch-newbie and am developing a search-based website. > > > > > > > > How can I use Nutch to search for parameterized URLs? > > > > > > > > e.g. I want to search on an item called "xyz". The information on > > > > this item > > > > is available on http://www.somesite.com/somepage.jsp?id=someId > > > > where someId is the databaseId (generated by the host application) > > > > for item > > > > "xyz". > > > > > > > > I know that item "xyz" shows up with the above URL when I search > > > > using > > > > Google but it doesn't appear when I search for it using the sample > > web > > > > application provided with nutch. > > > > > > > > *Configuration:* > > > > > > > > I have configured the crawl-urlfilter.txt to : > > > > > > > > # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/> > > > > *+^http://([a-z0-9]*\.)*somesite.com/* > > > > > > > > My *urls* folder contains a text file containing : > > > > *http://www.somesite.com* > > > > > > > > and I executed the command: *bin/nutch crawl urls -dir crawldir - > > > > depth 3* > > > > > > > > How can I get: http://www.somesite.com/somepage.jsp?id=someId when > > > > I search > > > > for "xyz" the same way it shows up during a Google search? > > > > > > > > Your help would be much appreciated, > > > > Rohit > > > > > > > > > > > > >
