[jira] [Commented] (TINKERPOP-1133) ScriptRecordReader should allow any class implementing/extending RecordReader to bust up blocks, not just LineRecordReader

Marko A. Rodriguez (JIRA) Sun, 07 Feb 2016 08:49:23 -0800

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136309#comment-15136309
 ]


Marko A. Rodriguez commented on TINKERPOP-1133:
-----------------------------------------------

I (personally) don't want to make things more complicated than they are. If 
someone wants to feed XML data into Groovy via an InputFormat, they can just 
write their own InputFormat. Its really not that hard. As you can see, 
{{ScriptInputFormat}} is ~75 lines of code.

https://github.com/apache/incubator-tinkerpop/blob/master/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.java
 (5 lines of code here)

https://github.com/apache/incubator-tinkerpop/blob/master/hadoop-gremlin/src/main/java/org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptRecordReader.java
 (100 lines of code here w/ 25 being just calling super).

We make it even less of a chose by pulling out the {{ScriptElementFactory}} and 
making that accessible to others wanting to reuse that chunk (thus only ~50 
lines of code for their format)....but to go through the trouble of complicated 
configuration properties, testing, maintenance, and the like doesn't seem worth 
it to me.

{code}
// omit closing angled bracket to start block split before attributes
gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
// An idea for later, because the above has big issues with nested elements
gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
{code}

XML is so crazy and all the configurations needed...and thats just XML. What 
about when someone says "What about YAML?" ... "What about my custom JSON -- 
why does it have to be only GraphSON?" "What about ............." I think 
TinkerPop provides a sufficient coverage of most use cases and examples that 
allow people to extend with their custom code as need be. What would be great 
is if we had more JavaDoc and documentation around our 3 formats 
(Gryo/GraphSON/Script) so that others can more easily emulate them for their 
custom needs.


> ScriptRecordReader should allow any class implementing/extending RecordReader 
> to bust up blocks, not just LineRecordReader
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-1133
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1133
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: documentation, hadoop, io
>    Affects Versions: 3.2.0-incubating, 3.1.2-incubating
>            Reporter: Dylan Bethune-Waddell
>            Priority: Minor
>
> I stuck a slightly modified {{XmlRecordReader}} class from the Apache Mahout 
> project into 
> {{org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader}}
>  to bulk load XML with ScriptInputFormat, which I have notes on here:
> https://github.com/dylanht/thamyris
> I'm not sure what other formats would need a custom record reader, but why 
> not allow it and let any class that implements {{RecordReader}} feed the 
> user's groovy script? I was thinking the config would be something like:
> {code}
> // Enum for <Format>RecordReaders TinkerPop provides, otherwise fully 
> qualified class name
> gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader
> // omit closing angled bracket to start block split before attributes
> gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
> gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
> // An idea for later, because the above has big issues with nested elements
> gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
> {code}
> Hadoop's {{RecordReader}} interface has {{InterruptedException}} checked for 
> several methods, whereas {{LineRecordReader}} doesn't throw it for the 
> respective methods. That's fine if {{LineRecordReader}} is imported directly 
> as it is now, or {{XmlRecordReader}} is a weird hidden inner class the way I 
> had it before. But to initialize anything that implements {{RecordReader}}, 
> it seems {{LineRecordReader}} and {{XmlRecordReader}} both have to end up in 
> the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with 
> something like this added in:
> {code}
> // same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress
> public void initialize() throws IOException, InterruptedException {
>     // doesn't enclose things in a try/catch as is
>     try { // things } catch (InterruptedException e) {
>             Thread.currentThread().interrupt();
>             throw new RuntimeException(e.getMessage(), e);
>         }
> }
> {code}
> I don't know how good an idea pulling {{LineRecordReader}} and 
> {{XmlRecordReader}} into that package is, or how to handle the 
> InterruptedException, and if there are more useful "<Format>RecordReader" 
> classes that could be implemented I would like to know about them, so I 
> thought I would throw this up here before trying a PR. What do you think?
> References:
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1133) ScriptRecordReader should allow any class implementing/extending RecordReader to bust up blocks, not just LineRecordReader

Reply via email to