[jira] [Commented] (TINKERPOP-1133) ScriptRecordReader should allow any class implementing/extending RecordReader to bust up blocks, not just LineRecordReader

Dylan Bethune-Waddell (JIRA) Sun, 07 Feb 2016 11:39:51 -0800

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136392#comment-15136392
 ]


Dylan Bethune-Waddell commented on TINKERPOP-1133:
--------------------------------------------------

You know, I thought a better answer to "What do I do to get bunches of 
<formatX> into my graph?" than writing an {{InputFormat}} was writing a 
{{RecordReader}} - but really it's worse. You loose flexibility, TinkerPop has 
to do more stuff to maintain things, and like you said it's really no harder to 
imitate {{ScriptInputFormat}} and {{ScriptRecordReader}} directly. Just some 
docs on it would be a way better solution, I agree.

XML is totally crazy, but in biomedicine there are giant awkward chunks of it I 
needed to bulk load. However, I don't think anyone was interested when I linked 
to that repo above detailing how I hacked that together on the (Titan?) mailing 
list. What if I did a tutorial or some docs on how to actually write your own 
{{InputFormat}} and maybe {{OutputFormat}} with record readers/writers, and 
bulk load / dump with it? If XML isn't really that interesting to people, is 
there a format you think would resonate more with your user base?

> ScriptRecordReader should allow any class implementing/extending RecordReader 
> to bust up blocks, not just LineRecordReader
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-1133
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1133
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: documentation, hadoop, io
>    Affects Versions: 3.2.0-incubating, 3.1.2-incubating
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>
> I stuck a slightly modified {{XmlRecordReader}} class from the Apache Mahout 
> project into 
> {{org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader}}
>  to bulk load XML with ScriptInputFormat, which I have notes on here:
> https://github.com/dylanht/thamyris
> I'm not sure what other formats would need a custom record reader, but why 
> not allow it and let any class that implements {{RecordReader}} feed the 
> user's groovy script? I was thinking the config would be something like:
> {code}
> // Enum for <Format>RecordReaders TinkerPop provides, otherwise fully 
> qualified class name
> gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader
> // omit closing angled bracket to start block split before attributes
> gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
> gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
> // An idea for later, because the above has big issues with nested elements
> gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
> {code}
> Hadoop's {{RecordReader}} interface has {{InterruptedException}} checked for 
> several methods, whereas {{LineRecordReader}} doesn't throw it for the 
> respective methods. That's fine if {{LineRecordReader}} is imported directly 
> as it is now, or {{XmlRecordReader}} is a weird hidden inner class the way I 
> had it before. But to initialize anything that implements {{RecordReader}}, 
> it seems {{LineRecordReader}} and {{XmlRecordReader}} both have to end up in 
> the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with 
> something like this added in:
> {code}
> // same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress
> public void initialize() throws IOException, InterruptedException {
>     // doesn't enclose things in a try/catch as is
>     try { // things } catch (InterruptedException e) {
>             Thread.currentThread().interrupt();
>             throw new RuntimeException(e.getMessage(), e);
>         }
> }
> {code}
> I don't know how good an idea pulling {{LineRecordReader}} and 
> {{XmlRecordReader}} into that package is, or how to handle the 
> InterruptedException, and if there are more useful "<Format>RecordReader" 
> classes that could be implemented I would like to know about them, so I 
> thought I would throw this up here before trying a PR. What do you think?
> References:
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1133) ScriptRecordReader should allow any class implementing/extending RecordReader to bust up blocks, not just LineRecordReader

Reply via email to