Hi all, Just an update that I finally figured out why I was not getting any parameterized URLs in my search: In my crawl-urlfilter.txt file the line
# skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] was preventing any URLs with "?" to be skipped. I replaced it with [EMAIL PROTECTED] and walla! I got my URL! thanks much for your help. regards, Rohit On 4/30/08, Rohit Potnis <[EMAIL PROTECTED]> wrote: > > sorry... i was replying to Jasper's comments on searching the index... Any > help regarding my last reply to this chain (my alternative approach)? > > also, please ignore the *...* surrounding the nutch configuration entries > in the previous email.. (I guess Rich Text mail is not supported :)) > e.g. read *+^http://([a-z0-9]*\.)*somesite.com/* as just +^http:// > ([a-z0-9]*\.)*somesite.com/ > > Waiting for a reply.. > > Rohit > > > On 4/30/08, Rohit Potnis <[EMAIL PROTECTED]> wrote: >> >> Thanks for your replies.. >> >> @Otis: >> >> Continuing from our previous email exchange: >> >> The "xyz" value was not in my list of indexes. >> >> So I tried an alternative: >> >> in my urls folder, I changed the url in the urls folder to: >> http://www.somesite.com/somepage.jsp?id=someId >> hoping that this would fetch only one URL. >> >> my crawl-urlfilter.txt was configured for: >> # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/> >> *+^http://([a-z0-9]*\.)*somesite.com/* >> >> and I executed the command: *bin/nutch crawl urls -dir crawldir - depth >> 10* >> >> This, however, fetched 0 records. >> >> So now I'm wondering if my alternative was correct? If not, can you please >> help me understand the right way to search this? >> >> thanks much, >> Rohit >> >> On 4/30/08, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>> >>> JSP pages typically render HTML, so you don't need a JSP plugin, but an >>> parse-html plugin in your nutch-site.xml >>> >>> Otis >>> -- >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> ----- Original Message ---- >>> > From: Jasper Kamperman <[EMAIL PROTECTED]> >>> > To: [email protected] >>> > Sent: Wednesday, April 30, 2008 1:32:29 PM >>> > Subject: Re: Searching parameterized URLs >>> > >>> > I think the first question is to figure out whether the page with URL >>> > http://www.somesite.com/somepage.jsp?id=someId even made it into your >>> > index. There are several ways to do this, personally I tend to use >>> > luke to have a look at the index, tell luke to open your nutch-0.9/ >>> > crawl/index directory (which is where it ends up if you follow the >>> > default instructions for running the crawl). >>> > >>> > If the page is in your index you can use luke to see what fields were >>> > extracted, hopefully there is some field named "foo" which would have >>> > "xyz" somewhere. The Nutch demo app should then find the page if you >>> > specify foo:xyz in the searchbar. If "foo" is one of "content", >>> > "title", "anchor" or "url" then the demo app should find it if you >>> > plainly search for xyz, no need to specify any of the default fields. >>> > >>> > Since it is a jsp page, it is entirely possible that you either don't >>> > have the correct (jsp) plugin configured or that the plugin you have >>> > isn't smart enough to get the content out of a jsp page. >>> > >>> > Jasper >>> > >>> > On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote: >>> > >>> > > Hi, >>> > > >>> > > I'm a nutch-newbie and am developing a search-based website. >>> > > >>> > > How can I use Nutch to search for parameterized URLs? >>> > > >>> > > e.g. I want to search on an item called "xyz". The information on >>> > > this item >>> > > is available on http://www.somesite.com/somepage.jsp?id=someId >>> > > where someId is the databaseId (generated by the host application) >>> > > for item >>> > > "xyz". >>> > > >>> > > I know that item "xyz" shows up with the above URL when I search >>> > > using >>> > > Google but it doesn't appear when I search for it using the sample >>> web >>> > > application provided with nutch. >>> > > >>> > > *Configuration:* >>> > > >>> > > I have configured the crawl-urlfilter.txt to : >>> > > >>> > > # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/> >>> > > *+^http://([a-z0-9]*\.)*somesite.com/* >>> > > >>> > > My *urls* folder contains a text file containing : >>> > > *http://www.somesite.com* >>> > > >>> > > and I executed the command: *bin/nutch crawl urls -dir crawldir - >>> > > depth 3* >>> > > >>> > > How can I get: http://www.somesite.com/somepage.jsp?id=someId when >>> > > I search >>> > > for "xyz" the same way it shows up during a Google search? >>> > > >>> > > Your help would be much appreciated, >>> > > Rohit >>> > >>> > >>> >>> >>> >> >
