[Nutch Wiki] Update of "AdvancedAjaxInteraction" by MichaelJoyce

Apache Wiki Tue, 15 Sep 2015 10:29:14 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "AdvancedAjaxInteraction" page has been changed by MichaelJoyce:
https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=4&rev2=5

Comment:
Updates regarding available selenium plugins, including new info on the 
interactiveselenium protocol

  
  == Lets Begin with a Scenario ==
  
- So lets say that as a Nutch crawl administrator your client has tasked you 
with the following '''''"Get me domain specific material a database such as 
NTIS"''''' (NTIS; the National Technical Information Service, serves as the 
largest central resource for government-funded scientific, technical, 
engineering, and business related information available today.)
+ So lets say that as a Nutch crawl administrator your client has tasked you 
with the following '''''"Get me domain specific material from a database such 
as NTIS"''''' (NTIS; the National Technical Information Service, serves as the 
largest central resource for government-funded scientific, technical, 
engineering, and business related information available today.)
  What this really translates to is the following:
   * use Nutch to log in to a database which requires 
[[https://wiki.apache.org/nutch/HttpPostAuthentication|HTTP POST 
authentication]]
   * follow the redirect to the database landing query form
@@ -17, +17 @@

   * use an 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]]
 to obtain high level article/document content
   * submit a GET request to invoke JavaScript which will return a PDF of the 
full textual content for this document
   * return the full document (PDF) content and metadata along with the HTML 
parse filter data
+ 
+ == Crawling JavaScript/AJAX sites ==
+ 
+ In order to crawl webpages that rely on JavaScript/AJAX to dynamically load 
content you will want to use the 
[[https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium|Protocol-Selenium
 Plugin]]. This plugin will load the pages that you're crawling in Selenium so 
that JavaScript will be handled properly.
+ 
+ If you need to interact with the pages that you're crawling (E.g., JavaScript 
based pagination, clicking elements to dynamically load content) you will want 
to use the 
[[https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium|Protocol-InteractiveSelenium]]
 plugin. With this plugin you will create Handlers that interact with the pages 
in a defined way using the Selenium WebDriver interface. With this you'll be 
able to do any Selenium based interactions that you wish on a per-URL basis.
  
  == Related Development Issues ==

[Nutch Wiki] Update of "AdvancedAjaxInteraction" by MichaelJoyce

Reply via email to