2009/7/17 Doğacan Güney <[email protected]>:
> On Fri, Jul 17, 2009 at 15:23, Larsson85<[email protected]> wrote:
>>
>> Any workaround for this? Making nutch identify as something else or something
>> similar?
>>
>
> Also note that nutch does not crawl anything with '?', or '&' in URL. Check 
> out


Oops. I mean nutch does not crawl any such URL *by default*.

> crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you
> use crawl command
> or inject/generate/fetch/parse etc. commands).
>
>>
>> reinhard schwab wrote:
>>>
>>> http://www.google.se/robots.txt
>>>
>>> google disallows it.
>>>
>>> User-agent: *
>>> Allow: /searchhistory/
>>> Disallow: /search
>>>
>>>
>>> Larsson85 schrieb:
>>>> Why isnt nutch able to handle links from google?
>>>>
>>>> I tried to start a crawl from the following url
>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>>
>>>> And all I get is "no more URLs to fetch"
>>>>
>>>> The reason for why I want to do this is because I had a tought on maby I
>>>> could use google to generate my start list of urls by injecting pages of
>>>> search result.
>>>>
>>>> Why wont this page be parsed and links extracted so the crawl can start?
>>>>
>>>
>>>
>>>
>>
>> --
>> View this message in context: 
>> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Doğacan Güney
>



-- 
Doğacan Güney

Reply via email to