Re: [Nutch-general] page1 is crawled, but not pages in page1

Nitin Borwankar Wed, 06 Dec 2006 09:19:58 -0800

Hi Philip,

You have www.visitpa.com in your crawl-url-filter regexp.
If some of your other pages have <something else>.visitpa.com as host 
name they will be filtered out.
You may want to have just (....)visitpa.com in the regexp in  that case.
Just a thought.


Nitin Borwankar
http://tagschema.com

spamsucks wrote:

> Hi Yoni,
>
> That was a good thought, however, according to the logging output of 
> the crawl, I see the following...
>
> fetching http://www.visitpa.com/visitpa/details.pa?id=65851
> fetching http://www.visitpa.com/visitpa/details.pa?id=246139
> fetching http://www.visitpa.com/visitpa/details.pa?id=8427
>
> There are at least 100+ of these (too many to count) so it appears 
> that nutch is fetching these url's although the url is not unique 
> without the query string.
>
> Building upon your thought, perhaps the other "details.pa" pages are 
> coming from other pages being indexed, and only one "details.pa" page 
> is being used in the sense of a crawl.  That could be what is 
> happening here and your point is correct.
>
> I appreciate your response!
> Phillip
>
>
>
> ----- Original Message ----- From: "Yoni Amir" <[EMAIL PROTECTED]>
> To: <nutch-user@lucene.apache.org>
> Sent: Wednesday, December 06, 2006 10:47 AM
> Subject: Re: page1 is crawled, but not pages in page1
>
>
> The I think that in crawldb and linkdb, the actual url, without the
> query string, serves as primary key (i.e. a url is determined as unique
> just by looking at the url, without the query string). Thus, after your
> first page is fetched, and you run updatedb, nutch doesn't think that it
> needs to fetch it again because it already sees an entry for it in the
> database.
>
> I am also new to nutch, so I don't know if there is a solution to your
> problem.
>
> Yoni
>
> On Wed, 2006-12-06 at 10:05 -0500, spamsucks wrote:
>
>> My subject is a pretty good summary.  I see the first 
>> "details.pa?id=123" in
>> my results, but can't search or find any "details.pa?id=456" links 
>> that are
>> in that 1st page that was a hit.
>>
>> Backgrounder:
>> I have a site that includes a lot of dynamic pages.  I edited the
>> crawl-urlfilter.txt and added the following regex and did
>> a crawl (bin/nutch crawl urls -dir crawl -depth 30 -topN 30000):
>>
>> +^http://([a-z0-9]*\.)*www.visitpa.com/visitpa/details.pa\?id=
>>
>> Now the search will return hits on the dynamic details page.  For 
>> example,
>> here is a search that returns hits on my dynamic pages.
>> http://prhodes.r-effects.com/nutch/search.jsp?query=sunnyledge&hitsPerPage=10〈=en
>>  
>>
>>
>> If you look at the details.pa page that nutch had a hit on, it contains
>> several links of the same format ( details.pa )
>> My problem is that these other detail links are not being 
>> crawled/indexed.
>>
>> I set the depth to "30" so that should not be a limiting factor.  I 
>> also set
>> a "topN" of 30000, because we have around 16K details.pa pages
>>
>> Any clues on how to proceed and figure out what I need to do to get 
>> Nutch to
>> crawl these missing "details.pa" links
>>
>>
>>
>>
>>
>
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] page1 is crawled, but not pages in page1

Reply via email to