[ https://issues.apache.org/jira/browse/TINKERPOP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458759#comment-15458759 ]
stephen mallette commented on TINKERPOP-1133: --------------------------------------------- [~dylanht] I was just doing some review of issues and saw this one hanging out there. are you still interested in doing a tutorial on this topic? > ScriptRecordReader should allow any class implementing/extending RecordReader > to bust up blocks, not just LineRecordReader > -------------------------------------------------------------------------------------------------------------------------- > > Key: TINKERPOP-1133 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1133 > Project: TinkerPop > Issue Type: Improvement > Components: documentation, hadoop, io > Affects Versions: 3.2.0-incubating, 3.1.2-incubating > Reporter: Dylan Bethune-Waddell > Priority: Minor > > I stuck a slightly modified {{XmlRecordReader}} class from the Apache Mahout > project into > {{org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader}} > to bulk load XML with ScriptInputFormat, which I have notes on here: > https://github.com/dylanht/thamyris > I'm not sure what other formats would need a custom record reader, but why > not allow it and let any class that implements {{RecordReader}} feed the > user's groovy script? I was thinking the config would be something like: > {code} > // Enum for <Format>RecordReaders TinkerPop provides, otherwise fully > qualified class name > gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader > // omit closing angled bracket to start block split before attributes > gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer > gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer> > // An idea for later, because the above has big issues with nested elements > gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3] > {code} > Hadoop's {{RecordReader}} interface has {{InterruptedException}} checked for > several methods, whereas {{LineRecordReader}} doesn't throw it for the > respective methods. That's fine if {{LineRecordReader}} is imported directly > as it is now, or {{XmlRecordReader}} is a weird hidden inner class the way I > had it before. But to initialize anything that implements {{RecordReader}}, > it seems {{LineRecordReader}} and {{XmlRecordReader}} both have to end up in > the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with > something like this added in: > {code} > // same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress > public void initialize() throws IOException, InterruptedException { > // doesn't enclose things in a try/catch as is > try { // things } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > throw new RuntimeException(e.getMessage(), e); > } > } > {code} > I don't know how good an idea pulling {{LineRecordReader}} and > {{XmlRecordReader}} into that package is, or how to handle the > InterruptedException, and if there are more useful "<Format>RecordReader" > classes that could be implemented I would like to know about them, so I > thought I would throw this up here before trying a PR. What do you think? > References: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)