Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchAndFronteraDesignGoals" page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/NutchAndFronteraDesignGoals?action=diff&rev1=1&rev2=2 Currently, Frontera is moving towards the ease of use: ZeroMQ transport, transport layer abstraction, standalone Frontera/Scrapy based crawler in Docker, web UI. + = Major Nutch Use Cases and Future Vision that Align with Frontera = - = Major Nutch Use Cases and Vision = - <<TBD>> + 1. Deep Web Extractions - both from a crawler, and using various interactive and non-obtrusive Javascript libraries. We started with Selenium but are now looking at HTMLUntil, PhantomJS, and others. + 2. Measuring Crawl Footprint - I think we need to understand better the crawl footprint, and use that information to better guide and strategize crawling. + 3. Adaptive and ML-based crawling algorithms - my team is working on a Machine Learning based algorithm for crawling that leverages Naive Bayes, and RL. + 4. Content Extraction from more and more formats with Tika. This is one potential area we could overlap on since there is both a Tika Python library and Nutch Python library (originating from DARPA Memex). + + Seems like the more stuff we do in the Python libraries, and with Tika potentially could serve as an initial integration. As for broader + crawling, I’m also interested in how Nutch and Spark can work together. Nutch over Spark is something I have a few researchers in my team + working on now. +

