Dylan Bethune-Waddell created TINKERPOP-1133:
------------------------------------------------

             Summary: ScriptInputFormat should allow any class implementing 
RecordReader other than "LineRecordReader"
                 Key: TINKERPOP-1133
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1133
             Project: TinkerPop
          Issue Type: Improvement
          Components: documentation, hadoop, io
    Affects Versions: 3.2.0-incubating, 3.1.2-incubating
            Reporter: Dylan Bethune-Waddell
            Priority: Minor


I stuck a slightly modified {{XmlRecordReader}} class from the Apache Mahout 
project into 
{{org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader}} 
to bulk load XML with ScriptInputFormat, which I have notes on here:
https://github.com/dylanht/thamyris

I'm not sure what other formats would need a custom record reader, but why not 
allow it and let any class that implements {{RecordReader}} feed the user's 
groovy script? I was thinking the config would be something like:

{code}
// Enum for <Format>RecordReaders TinkerPop provides, otherwise fully qualified 
class name
gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader

// omit closing angled bracket to start block split before attributes
gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
// An idea for later, because the above has big issues with nested elements
gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
{code}

Hadoop's {{RecordReader}} interface has {{InterruptedException}} checked for 
several methods, whereas {{LineRecordReader}} doesn't throw it for the 
respective methods. That's fine if {{LineRecordReader}} is imported directly as 
it is now, or {{XmlRecordReader}} is a weird hidden inner class the way I had 
it before. But to initialize anything that implements {{RecordReader}}, it 
seems {{LineRecordReader}} and {{XmlRecordReader}} both have to end up in the 
org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with something 
like this added in:

{code}
// same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress
public void initialize() throws IOException, InterruptedException {
    // doesn't enclose things in a try/catch as is
    try { // things } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new RuntimeException(e.getMessage(), e);
        }
}
{code}

I don't know how good an idea pulling {{LineRecordReader}} and 
{{XmlRecordReader}} into that package is, or how to handle the 
InterruptedException, and if there are more useful "<Format>RecordReader" 
classes that could be implemented I would like to know about them, so I thought 
I would throw this up here before trying a PR. What do you think?

References:
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to