[
https://issues.apache.org/jira/browse/TINKERPOP-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514557#comment-15514557
]
Dylan Bethune-Waddell commented on TINKERPOP-1133:
--------------------------------------------------
Hi [~spmallette], I should be looking back into this area fairly soon for a
revamp to possibly add XPath support to my XmlInputFormat approach. Things like
self-closing tags and elements of the same name at multiple levels of nesting
screw up the token based file split on the opening and closing tag pretty badly
- when I get to these improvements I will do the tutorial at the same time.
I might also have a decent custom "BulkLoader" tutorial I could do - I wrote a
"PrecisionBulkLoader" and "PrecisionBulkLoaderVertexProgram" which allow the
user to define a string traversal that will be executed against the graph
database as a ScriptTraversal to resolve whether an element being loaded
already exists prior to loading. Certain configurable strings give access to
the inV/outV/Edge properties of the elements being loaded so that the traversal
can be dynamic. If I managed to push that out in decent form that would address
[TINKERPOP-1315|https://issues.apache.org/jira/browse/TINKERPOP-1315] and
[TINKERPOP-1099|https://issues.apache.org/jira/browse/TINKERPOP-1099] too.
I'll open a new ticket for each tutorial and link back to the related issues so
everything is set to be closed up cleanly?
> ScriptRecordReader should allow any class implementing/extending RecordReader
> to bust up blocks, not just LineRecordReader
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: TINKERPOP-1133
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1133
> Project: TinkerPop
> Issue Type: Improvement
> Components: documentation, hadoop, io
> Affects Versions: 3.2.0-incubating, 3.1.2-incubating
> Reporter: Dylan Bethune-Waddell
> Priority: Minor
>
> I stuck a slightly modified {{XmlRecordReader}} class from the Apache Mahout
> project into
> {{org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptRecordReader}}
> to bulk load XML with ScriptInputFormat, which I have notes on here:
> https://github.com/dylanht/thamyris
> I'm not sure what other formats would need a custom record reader, but why
> not allow it and let any class that implements {{RecordReader}} feed the
> user's groovy script? I was thinking the config would be something like:
> {code}
> // Enum for <Format>RecordReaders TinkerPop provides, otherwise fully
> qualified class name
> gremlin.hadoop.scriptInputFormat.reader=XML // vs. LINE or a.b.myReader
> // omit closing angled bracket to start block split before attributes
> gremlin.hadoop.scriptInputFormat.xml.startTag=<myCustomer
> gremlin.hadoop.scriptInputFormat.xml.endTag=</myCustomer>
> // An idea for later, because the above has big issues with nested elements
> gremlin.hadoop.scriptInputFormat.xml.xpath=/top/customer[position()<3]
> {code}
> Hadoop's {{RecordReader}} interface has {{InterruptedException}} checked for
> several methods, whereas {{LineRecordReader}} doesn't throw it for the
> respective methods. That's fine if {{LineRecordReader}} is imported directly
> as it is now, or {{XmlRecordReader}} is a weird hidden inner class the way I
> had it before. But to initialize anything that implements {{RecordReader}},
> it seems {{LineRecordReader}} and {{XmlRecordReader}} both have to end up in
> the org.apache.tinkerpop.gremlin.hadoop.structure.io.script package with
> something like this added in:
> {code}
> // same for nextKeyValue, getCurrentKey, getCurrentValue, getProgress
> public void initialize() throws IOException, InterruptedException {
> // doesn't enclose things in a try/catch as is
> try { // things } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> throw new RuntimeException(e.getMessage(), e);
> }
> }
> {code}
> I don't know how good an idea pulling {{LineRecordReader}} and
> {{XmlRecordReader}} into that package is, or how to handle the
> InterruptedException, and if there are more useful "<Format>RecordReader"
> classes that could be implemented I would like to know about them, so I
> thought I would throw this up here before trying a PR. What do you think?
> References:
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/RecordReader.java
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)