[
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341326#comment-16341326
]
Lewis John McGibbney commented on NUTCH-2369:
---------------------------------------------
Hi [~markus17] the idea here was to export full graph information into
something that could be interpreted by [Tinkerpop|http://tinkpop.apache.org]
and queried using [Gremlin|https://tinkerpop.apache.org/gremlin.html].
> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> ------------------------------------------------------------------------------
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
> Issue Type: Task
> Components: crawldb, graphgenerator, hostdb, linkdb, segment,
> storage, tool
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Labels: gsoc2017, gsoc2018
> Fix For: 1.15
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl
> Records.
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a
> Segment and possibly the HostDB in order to be fully populated. Graph
> characteristics e.g. Edge's would comes from those existing data structures
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have
> already talked offline with a potential student [~omkar20895] about him
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this
> could be a game changer for how Nutch Crawl data is interpreted. It is my
> feeling that this issue most likely also involved an entire upgrade of the
> Hadoop API's from mapred to mapreduce for the master codebase.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)