Re: Searching parameterized URLs

Rohit Potnis Thu, 01 May 2008 07:39:01 -0700

Hi all,

Just an update that I finally figured out why I was not getting any
parameterized URLs in my search:
In my crawl-urlfilter.txt file the line


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

was preventing any URLs with "?" to be skipped. I replaced it with
[EMAIL PROTECTED]

and walla! I got my URL!

thanks much for your help.

regards,
Rohit



On 4/30/08, Rohit Potnis <[EMAIL PROTECTED]> wrote:
>
> sorry... i was replying to Jasper's  comments on searching the index... Any
> help regarding my last reply to this chain (my alternative approach)?
>
> also, please ignore the *...* surrounding the nutch configuration entries
> in the previous email.. (I guess Rich Text mail is not supported :))
> e.g. read *+^http://([a-z0-9]*\.)*somesite.com/* as just +^http://
> ([a-z0-9]*\.)*somesite.com/
>
> Waiting for a reply..
>
> Rohit
>
>
> On 4/30/08, Rohit Potnis <[EMAIL PROTECTED]> wrote:
>>
>> Thanks for your replies..
>>
>> @Otis:
>>
>> Continuing from our previous email exchange:
>>
>> The "xyz" value was not in my list of indexes.
>>
>> So I tried an alternative:
>>
>> in my urls folder, I changed the url in the urls folder to:
>> http://www.somesite.com/somepage.jsp?id=someId
>> hoping that this would fetch only one URL.
>>
>> my crawl-urlfilter.txt was configured for:
>> # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/>
>> *+^http://([a-z0-9]*\.)*somesite.com/*
>>
>> and I executed the command: *bin/nutch crawl urls -dir crawldir - depth
>> 10*
>>
>> This, however, fetched 0 records.
>>
>> So now I'm wondering if my alternative was correct? If not, can you please
>> help me understand the right way to search this?
>>
>> thanks much,
>> Rohit
>>
>>  On 4/30/08, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>>
>>> JSP pages typically render HTML, so you don't need a JSP plugin, but an
>>> parse-html plugin in your nutch-site.xml
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>> ----- Original Message ----
>>> > From: Jasper Kamperman <[EMAIL PROTECTED]>
>>> > To: [email protected]
>>> > Sent: Wednesday, April 30, 2008 1:32:29 PM
>>> > Subject: Re: Searching parameterized URLs
>>> >
>>> > I think the first question is to figure out whether the page with URL
>>> > http://www.somesite.com/somepage.jsp?id=someId even made it into your
>>> > index. There are several ways to do this, personally I tend to use
>>> > luke to have a look at the index, tell luke to open your nutch-0.9/
>>> > crawl/index directory (which is where it ends up if you follow the
>>> > default instructions for running the crawl).
>>> >
>>> > If the page is in your index you can use luke to see what fields were
>>> > extracted, hopefully there is some field named "foo" which would have
>>> > "xyz" somewhere. The Nutch demo app should then find the page if you
>>> > specify foo:xyz in the searchbar. If "foo" is one of "content",
>>> > "title", "anchor" or "url" then the demo app should find it if you
>>> > plainly search for xyz, no need to specify any of the default fields.
>>> >
>>> > Since it is a jsp page, it is entirely possible that you either don't
>>> > have the correct (jsp) plugin configured or that the plugin you have
>>> > isn't smart enough to get the content out of a jsp page.
>>> >
>>> > Jasper
>>> >
>>> > On Apr 30, 2008, at 10:13 AM, Rohit Potnis wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I'm a nutch-newbie and am developing a search-based website.
>>> > >
>>> > > How can I use Nutch to search for parameterized URLs?
>>> > >
>>> > > e.g. I want to search on an item called "xyz". The information on
>>> > > this item
>>> > > is available on http://www.somesite.com/somepage.jsp?id=someId
>>> > > where someId is the databaseId (generated by the host application)
>>> > > for item
>>> > > "xyz".
>>> > >
>>> > >  I know that item "xyz" shows up with the above URL when I search
>>> > > using
>>> > > Google but it doesn't appear when I search for it using the sample
>>> web
>>> > > application provided with nutch.
>>> > >
>>> > > *Configuration:*
>>> > >
>>> > > I have configured the crawl-urlfilter.txt to :
>>> > >
>>> > > # accept hosts in MY.DOMAIN.NAME <http://my.domain.name/>
>>> > > *+^http://([a-z0-9]*\.)*somesite.com/*
>>> > >
>>> > > My *urls* folder contains a text file containing :
>>> > > *http://www.somesite.com*
>>> > >
>>> > > and I executed the command: *bin/nutch crawl urls -dir crawldir -
>>> > > depth 3*
>>> > >
>>> > > How can I get: http://www.somesite.com/somepage.jsp?id=someId when
>>> > > I search
>>> > > for "xyz" the same way it shows up during a Google search?
>>> > >
>>> > > Your help would be much appreciated,
>>> > > Rohit
>>> >
>>> >
>>>
>>>
>>>
>>
>

Re: Searching parameterized URLs

Reply via email to