Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GoogleSummerOfCode/GraphGeneratorTool" page has been changed by OmkarReddy: https://wiki.apache.org/nutch/GoogleSummerOfCode/GraphGeneratorTool?action=diff&rev1=2&rev2=3 <<TableOfContents>> - ||'''Title :'''|||| GSoC 2016 Proposal || + ||'''Title :'''|||| GSoC 2017 Proposal || ||'''Issue :'''|||| [[https://issues.apache.org/jira/browse/NUTCH-2369|NUTCH-2369 - Graph Generator Tool for Nutch]]|| ||'''Student :'''||||Omkar Reddy - omkarr [at] apache dot org|| ||'''Mentor :'''||||Lewis John McGibbney|| === Abstract === - Currently Apache Nutch[0] has the concept of a WebGraph[1] which that builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop[2] ScriptInputFormat and ScriptOutputFormat’s[3] respectively. There are basically two scenarios to represent the graph as RDF datasets that we discuss in this proposal below. + Currently Apache Nutch[0] has the concept of a WebGraph[1] that builds Web graphs, performs a stable convergent link-analysis, and updates the crawldb with those scores. The main purpose of building a new Graph Generator tool for Nutch is to create a substantiated ‘deep’ graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. This will involve storage of the crawl data as RDF datasets in the form of serialized n-quad statements. This graph can be used to execute queries on the webpages. Graph generation will be achieved using the Apache Tinkerpop[2] ScriptInputFormat and ScriptOutputFormat’s[3] respectively. There are basically two scenarios to represent the graph as RDF datasets that we discuss in this proposal below. === Background ===

