[Nutch Wiki] Update of "NutchAndFronteraDesignGoals" by ChrisMattmann

Apache Wiki Sun, 14 Feb 2016 22:16:16 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchAndFronteraDesignGoals" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/NutchAndFronteraDesignGoals?action=diff&rev1=1&rev2=2

  
  Currently, Frontera is moving towards the ease of use: ZeroMQ transport, 
transport layer abstraction, standalone Frontera/Scrapy based crawler in 
Docker, web UI.
  
+ = Major Nutch Use Cases and Future Vision that Align with Frontera =
- = Major Nutch Use Cases and Vision =
- <<TBD>>
  
+  1. Deep Web Extractions - both from a crawler, and using various interactive 
and non-obtrusive Javascript libraries. We started with  Selenium but are now 
looking at HTMLUntil, PhantomJS, and others.
+  2. Measuring Crawl Footprint - I think we need to understand better the 
crawl footprint, and use that information to better guide and strategize 
crawling.
+  3. Adaptive and ML-based crawling algorithms - my team is working on a 
Machine Learning based algorithm for crawling that leverages Naive Bayes, and 
RL. 
+  4. Content Extraction from more and more formats with Tika. This is one 
potential area we could overlap on since there is both a Tika Python library 
and Nutch Python library (originating from DARPA Memex).
+ 
+ Seems like the more stuff we do in the Python libraries, and with Tika 
potentially could serve as an initial integration. As for broader
+ crawling, I’m also interested in how Nutch and Spark can work together. Nutch 
over Spark is something I have a few researchers in my team
+ working on now.
+

[Nutch Wiki] Update of "NutchAndFronteraDesignGoals" by ChrisMattmann

Reply via email to