Sorry about that, I was confusing JSP with JavaScript, which requires plugin parse-js .

On 4/30/08, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

JSP pages typically render HTML, so you don't need a JSP plugin, but an
parse-html plugin in your nutch-site.xml

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Jasper Kamperman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, April 30, 2008 1:32:29 PM
Subject: Re: Searching parameterized URLs

I think the first question is to figure out whether the page with URL http://www.somesite.com/somepage.jsp?id=someId even made it into your
index. There are several ways to do this, personally I tend to use
luke to have a look at the index, tell luke to open your nutch-0.9/
crawl/index directory (which is where it ends up if you follow the
default instructions for running the crawl).

If the page is in your index you can use luke to see what fields were extracted, hopefully there is some field named "foo" which would have
"xyz" somewhere. The Nutch demo app should then find the page if you
specify foo:xyz in the searchbar. If "foo" is one of "content",
"title", "anchor" or "url" then the demo app should find it if you
plainly search for xyz, no need to specify any of the default fields.

Since it is a jsp page, it is entirely possible that you either don't
have the correct (jsp) plugin configured or that the plugin you have
isn't smart enough to get the content out of a jsp page.

Jasper

On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote:

Hi,

I'm a nutch-newbie and am developing a search-based website.

How can I use Nutch to search for parameterized URLs?

e.g. I want to search on an item called "xyz". The information on
this item
is available on http://www.somesite.com/somepage.jsp?id=someId
where someId is the databaseId (generated by the host application)
for item
"xyz".

 I know that item "xyz" shows up with the above URL when I search
using
Google but it doesn't appear when I search for it using the sample web
application provided with nutch.

*Configuration:*

I have configured the crawl-urlfilter.txt to :

# accept hosts in MY.DOMAIN.NAME
*+^http://([a-z0-9]*\.)*somesite.com/*

My *urls* folder contains a text file containing :
*http://www.somesite.com*

and I executed the command: *bin/nutch crawl urls -dir crawldir -
depth 3*

How can I get: http://www.somesite.com/somepage.jsp?id=someId when
I search
for "xyz" the same way it shows up during a Google search?

Your help would be much appreciated,
Rohit






Reply via email to