[
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2369:
-----------------------------------
Fix Version/s: (was: 1.15)
> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> ------------------------------------------------------------------------------
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
> Issue Type: Task
> Components: crawldb, graphgenerator, hostdb, linkdb, segment,
> storage, tool
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Labels: gsoc2017, gsoc2018
>
> I've been thinking for quite some time now that a new Tool which writes Nutch
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl
> Records.
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a
> Segment and possibly the HostDB in order to be fully populated. Graph
> characteristics e.g. Edge's would comes from those existing data structures
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have
> already talked offline with a potential student [~omkar20895] about him
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this
> could be a game changer for how Nutch Crawl data is interpreted. It is my
> feeling that this issue most likely also involved an entire upgrade of the
> Hadoop API's from mapred to mapreduce for the master codebase.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)