Re: Searching parameterized URLs

Jasper Kamperman Thu, 01 May 2008 09:20:32 -0700

Sorry about that, I was confusing JSP with JavaScript, which requiresplugin parse-js .

On 4/30/08, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

JSP pages typically render HTML, so you don't need a JSP plugin,but an

parse-html plugin in your nutch-site.xml

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----

From: Jasper Kamperman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 30, 2008 1:32:29 PM
Subject: Re: Searching parameterized URLs

I think the first question is to figure out whether the page withURLhttp://www.somesite.com/somepage.jsp?id=someId even made it intoyour

index. There are several ways to do this, personally I tend to use
luke to have a look at the index, tell luke to open your nutch-0.9/
crawl/index directory (which is where it ends up if you follow the
default instructions for running the crawl).

If the page is in your index you can use luke to see what fieldswereextracted, hopefully there is some field named "foo" which wouldhave

"xyz" somewhere. The Nutch demo app should then find the page if you
specify foo:xyz in the searchbar. If "foo" is one of "content",
"title", "anchor" or "url" then the demo app should find it if you

plainly search for xyz, no need to specify any of the defaultfields.

Since it is a jsp page, it is entirely possible that you eitherdon't

have the correct (jsp) plugin configured or that the plugin you have
isn't smart enough to get the content out of a jsp page.

Jasper

On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote:

Hi,

I'm a nutch-newbie and am developing a search-based website.

How can I use Nutch to search for parameterized URLs?

e.g. I want to search on an item called "xyz". The information on
this item
is available on http://www.somesite.com/somepage.jsp?id=someId
where someId is the databaseId (generated by the host application)
for item
"xyz".

 I know that item "xyz" shows up with the above URL when I search
using

Google but it doesn't appear when I search for it using thesample web

application provided with nutch.

*Configuration:*

I have configured the crawl-urlfilter.txt to :

# accept hosts in MY.DOMAIN.NAME
*+^http://([a-z0-9]*\.)*somesite.com/*

My *urls* folder contains a text file containing :
*http://www.somesite.com*

and I executed the command: *bin/nutch crawl urls -dir crawldir -
depth 3*

How can I get: http://www.somesite.com/somepage.jsp?id=someId when
I search
for "xyz" the same way it shows up during a Google search?

Your help would be much appreciated,
Rohit

Re: Searching parameterized URLs

Reply via email to