Re: Why cant I inject a google link to the database?

Jake Jacobson Fri, 17 Jul 2009 06:38:47 -0700

Larsson85,

Please read past responses.  Google is blocking all crawlers, not just
yours from indexing their search results.  Because of their robots.txt
file directives you will not be able to do this.


If you place a sign on your house, "DO NOT ENTER", and I entered, you
would be very upset.  That is what the robots.txt file does for a
site.  It tells visiting bots what they can enter and what they can't
enter.

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Fri, Jul 17, 2009 at 9:32 AM, Larsson85<[email protected]> wrote:
>
> I think I need more help on how to do this.
>
> I tried using
> <property>
>  <name>http.robots.agents</name>
>  <value>Mozilla/5.0*</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence. You should
>  put the value of http.agent.name as the first agent name, and keep the
>  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>  </description>
> </property>
>
> If I dont have the star in the end I get the same as earlier, "No URLs to
> fetch". And if I do I get 0 records selected for fetching, exiting
>
>
>
> reinhard schwab wrote:
>>
>> identify nutch as popular user agent such as firefox.
>>
>> Larsson85 schrieb:
>>> Any workaround for this? Making nutch identify as something else or
>>> something
>>> similar?
>>>
>>>
>>> reinhard schwab wrote:
>>>
>>>> http://www.google.se/robots.txt
>>>>
>>>> google disallows it.
>>>>
>>>> User-agent: *
>>>> Allow: /searchhistory/
>>>> Disallow: /search
>>>>
>>>>
>>>> Larsson85 schrieb:
>>>>
>>>>> Why isnt nutch able to handle links from google?
>>>>>
>>>>> I tried to start a crawl from the following url
>>>>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N
>>>>>
>>>>> And all I get is "no more URLs to fetch"
>>>>>
>>>>> The reason for why I want to do this is because I had a tought on maby
>>>>> I
>>>>> could use google to generate my start list of urls by injecting pages
>>>>> of
>>>>> search result.
>>>>>
>>>>> Why wont this page be parsed and links extracted so the crawl can
>>>>> start?
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Why cant I inject a google link to the database?

Reply via email to