I'm not familiar with EDI, but perhaps one option might be spark-xml-utils 
(https://github.com/elsevierlabs-os/spark-xml-utils).  You could transform the 
XML to the XML format required by  the xml-to-json function and then return the 
json.  Spark-xml-utils wraps the open source Saxon project and supports XPath, 
XQuery, and XSLT.    Spark-xml-utils doesn't parallelize the parsing of an 
individual document, but if you have your documents split across a cluster, the 
processing can be parallelized.  We use this package extensively within our 
company to process millions of XML records.  If you happen to be attending 
Spark summit in a few months, someone will be presenting on this topic 
(https://databricks.com/session/mining-the-worlds-science-large-scale-data-matching-and-integration-from-xml-corpora).

Below is a snippet for xquery.
let $retval :=     <map>       <string key="doi">{$doi}</string>       <string 
key="cid">{$cid}</string>       <string key="pii">{$pii}</string>       <string 
key="contentType">{$content-type}</string>       <string 
key="srctitle">{$srctitle}</string>       <string 
key="documentType">{$document-type}</string>       <string 
key="documentSubtype">{$document-subtype}</string>       <string 
key="publicationDate">{$publication-date}</string>       <string 
key="articleTitle">{$article-title}</string>       <string 
key="issn">{$issn}</string>       <string key="isbn">{$isbn}</string>           
 <string key="lang">{$lang}</string>        {$tables}     </map>  return 
xml-to-json($retval)

Darin.
    On Tuesday, March 13, 2018, 8:52:42 AM EDT, Aakash Basu 
<aakash.spark....@gmail.com> wrote:  
 
 Hi Jörn,

Thanks for a quick revert. I already built a EDI to JSON parser from scratch 
using the 811 and 820 standard mapping document. It can run on any standard and 
for any type of EDI. But my built is in native python and doesn't leverage 
Spark's parallel processing, which I want to do for large and huge amount of 
EDI data.

Any pointers on that?

Thanks,
Aakash.

On Tue, Mar 13, 2018 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:

Maybe there are commercial ones. You could also some of the open source parser 
for xml.

However xml is very inefficient and you need to du a lot of tricks to make it 
run in parallel. This also depends on type of edit message etc. sophisticated 
unit testing and performance testing is key.

Nevertheless it is also not as difficult as I made it sound now.

> On 13. Mar 2018, at 10:36, Aakash Basu <aakash.spark....@gmail.com> wrote:
>
> Hi,
>
> Did anyone built parallel and large scale X12 EDI parser to XML or JSON using 
> Spark?
>
> Thanks,
> Aakash.


  

Reply via email to