Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "AdvancedAjaxInteraction" page has been changed by MichaelJoyce: https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=4&rev2=5 Comment: Updates regarding available selenium plugins, including new info on the interactiveselenium protocol == Lets Begin with a Scenario == - So lets say that as a Nutch crawl administrator your client has tasked you with the following '''''"Get me domain specific material a database such as NTIS"''''' (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) + So lets say that as a Nutch crawl administrator your client has tasked you with the following '''''"Get me domain specific material from a database such as NTIS"''''' (NTIS; the National Technical Information Service, serves as the largest central resource for government-funded scientific, technical, engineering, and business related information available today.) What this really translates to is the following: * use Nutch to log in to a database which requires [[https://wiki.apache.org/nutch/HttpPostAuthentication|HTTP POST authentication]] * follow the redirect to the database landing query form @@ -17, +17 @@ * use an [[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]] to obtain high level article/document content * submit a GET request to invoke JavaScript which will return a PDF of the full textual content for this document * return the full document (PDF) content and metadata along with the HTML parse filter data + + == Crawling JavaScript/AJAX sites == + + In order to crawl webpages that rely on JavaScript/AJAX to dynamically load content you will want to use the [[https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium|Protocol-Selenium Plugin]]. This plugin will load the pages that you're crawling in Selenium so that JavaScript will be handled properly. + + If you need to interact with the pages that you're crawling (E.g., JavaScript based pagination, clicking elements to dynamically load content) you will want to use the [[https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium|Protocol-InteractiveSelenium]] plugin. With this plugin you will create Handlers that interact with the pages in a defined way using the Selenium WebDriver interface. With this you'll be able to do any Selenium based interactions that you wish on a per-URL basis. == Related Development Issues ==

