Lewis John McGibbney created NUTCH-2369:
-------------------------------------------

             Summary: Create a new GraphGenerator Tool for writing Nutch 
Records as Full Web Graphs
                 Key: NUTCH-2369
                 URL: https://issues.apache.org/jira/browse/NUTCH-2369
             Project: Nutch
          Issue Type: Task
          Components: graphgenerator, crawldb, hostdb, linkdb, segment, 
storage, tool
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.14


I've been thinking for quite some time now that a new Tool which writes Nutch 
data out as full graph data would be an excellent addition to the codebase.

My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
ScriptOutputFormat's to create Vertex objects representing Nutch Crawl Records. 

http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html

I envisage that each Vertex object would require the CrawlDB, LinkDB a Segment 
and possibly the HostDB in order to be fully populated. Graph characteristics 
e.g. Edge's would comes from those existing data structures as well.

It is my intention to propose this as a GSoC project for 2017 and I have 
already talked offline with a potential student [~omkar20895] about him 
participating as the student.

Essentially, if we were able to create a Graph enabling true traversal, this 
could be a game changer for how Nutch Crawl data is interpreted. It is my 
feeling that this issue most likely also involved an entire upgrade of the 
Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to