[
https://issues.apache.org/jira/browse/DRILL-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991442#comment-14991442
]
Magnus Pierre commented on DRILL-3878:
--------------------------------------
Hello,
I have a simple implementation of a format converter that converts XML to JSON
and run it through Drill JSONRecordReader which works fine for the test data I
have available. The concept works well and the performance is decent, but it
will build the complete JSON document in memory before handing it over to the
JSONRecordReader and that is an issue for larger documents. Currently I am
using a home-grown sax parser that builds the JSON document using org.JSON
classes. However, there are dom variants that also can do XSD validations and
so on. in order to be able to plug directly into JSONRecordReader without
having to duplicate the code, embeddedInfo, hadoopPath, and stream need either
to be changed from private to protected, or getters and setters need to be
provided.
Regarding XSD's I am considering if in dfs configuration if an additional
option per workspace referring to the file type XML, can have a XSD list/array
so any document in that workspace should adhere to the XSD's referred to
otherwise they will not be considered by Drill.
I will fill in the document, but I believe adding information in the jira
itself makes it more visible to other people in the community.
Best regards,
Magnus
> Support XML Querying (selects/projections, no writing)
> ------------------------------------------------------
>
> Key: DRILL-3878
> URL: https://issues.apache.org/jira/browse/DRILL-3878
> Project: Apache Drill
> Issue Type: New Feature
> Affects Versions: Future
> Reporter: Edmon Begoli
> Labels: features
> Fix For: Future
>
> Original Estimate: 3,360h
> Remaining Estimate: 3,360h
>
> Support querying of the XML documents (as read-only selects,
> Writing should be implemented as a different feature that brings its own set
> of challenges.)
> To consider is reading of the trivial, schema-less, XML documents,
> DTD-oriented ones and also of schema-defined ones.
> Also, we should consider direct querying vs. using converter tools to change
> the representation from XML to JSON, CSV, etc.
> Design and Implementation discussion, notes, ideas and implementation
> suggestions should be captured here:
> https://docs.google.com/document/d/1oS-cObSaTlAmuW_XghDLmHbBEorLl0z-axaHnjy7vg0/edit?usp=sharing
>
> (no vandalism, please)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)