Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "AdvancedAjaxInteraction" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=1&rev2=2

  
  == Lets Begin with a Scenario ==
  
- xyz
+ So lets say that as a Nutch crawl administrator your client has tasked you 
with the following '''''"Get me domain specific material a database such as 
NTIS"''''' (NTIS; the National Technical Information Service, serves as the 
largest central resource for government-funded scientific, technical, 
engineering, and business related information available today.)
+ What this really translates to is the following:
+  * use Nutch to log in to a database which requires 
[[https://wiki.apache.org/nutch/HttpPostAuthentication|HTTP POST 
authentication]]
+  * follow the redirect to the database landing query form
+  * submit a query to the form which will return a ranked list of search 
results for the given query
+  * interpret the JavaScript for each result in the ranked list
+  * use an 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/parse/HtmlParseFilter.html|HtmlParseFilter]]
 to obtain high level article/document content
+  * submit a GET request to invoke JavaScript which will return a PDF of the 
full textual content for this document
+  * return the full document (PDF) content and metadata along with the HTML 
parse filter data
  
  == Related Development Issues ==
  

Reply via email to