Re: Impolite crawling using NUTCH

Chris Mattmann Fri, 02 Dec 2016 07:13:49 -0800

Hmm, I’m a little confused here. You were first trying to use white list 
robots.txt, and now 
you are talking about Selenium.


 

1.       Did the white list work

2.       Are you now asking how to use Nutch and Selenium?

 

Cheers,

Chris

 

 

 

From: jyoti aditya <[email protected]>
Date: Thursday, December 1, 2016 at 10:26 PM
To: "Mattmann, Chris A (3010)" <[email protected]>
Subject: Re: Impolite crawling using NUTCH

 

Hi Chris, 

 

Thanks for the response.

I added the changes as you mentioned above.

 

But I am still not able to get all content from a webpage.

Can you please tell me that do I need to add some selenium plugin to crawl 

dynamic content available on web page?

 

I have a concern that this kind of wiki pages are not directly accessible.

There is no way we can reach to these kind of useful pages.

So please do needful regarding this.

 

 

With Regards,

Jyoti Aditya

 

On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) 
<[email protected]> wrote:

There is a robots.txt whitelist. You can find documentation here:

https://wiki.apache.org/nutch/WhiteListRobots

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, Open Source Projects Formulation and Development Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote:

    Sure, you can remove the check from the code and recompile.

    Under what circumstances would you need to ignore robots.txt ? Would
    something like allowing access by particular IP or user agents be an
    alternative ?

    Tom


    On 29/11/16 04:07, jyoti aditya wrote:
    > Hi team,
    >
    > Can we use NUTCH to do impolite crawling?
    > Or is there any way by which we can disobey robots.text?
    >
    >
    > With Regards
    > Jyoti Aditya
    >
    >
    > ______________________________________________________________________
    > This email has been scanned by the Symantec Email Security.cloud service.
    > For more information please visit http://www.symanteccloud.com
    > ______________________________________________________________________





 

-- 

With Regards 

Jyoti Aditya

Re: Impolite crawling using NUTCH

Reply via email to