On Fri, Jul 17, 2009 at 15:23, Larsson85<[email protected]> wrote:
>
> Any workaround for this? Making nutch identify as something else or something
> similar?
>

Also note that nutch does not crawl anything with '?', or '&' in URL. Check out
crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you
use crawl command
or inject/generate/fetch/parse etc. commands).

>
> reinhard schwab wrote:
>>
>> http://www.google.se/robots.txt
>>
>> google disallows it.
>>
>> User-agent: *
>> Allow: /searchhistory/
>> Disallow: /search
>>
>>
>> Larsson85 schrieb:
>>> Why isnt nutch able to handle links from google?
>>>
>>> I tried to start a crawl from the following url
>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>
>>> And all I get is "no more URLs to fetch"
>>>
>>> The reason for why I want to do this is because I had a tought on maby I
>>> could use google to generate my start list of urls by injecting pages of
>>> search result.
>>>
>>> Why wont this page be parsed and links extracted so the crawl can start?
>>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Reply via email to