Ah. Have you tried Jackson?
https://github.com/FasterXML/jackson-dataformat-xml/blob/master/README.md


_____________________________
From: Diwakar Dhanuskodi 
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>
Sent: Friday, August 19, 2016 9:41 PM
Subject: Re: Best way to read XML data from RDD
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
user <user@spark.apache.org<mailto:user@spark.apache.org>>


Yes . It accepts a xml file as source but not RDD. The XML data embedded  
inside json is streamed from kafka cluster.  So I could get it as RDD.
Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map 
function  but  performance  wise I am not happy as it takes 4 minutes to parse 
XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.


Sent from Samsung Mobile.


-------- Original message --------
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Date:20/08/2016 09:49 (GMT+05:30)
To: Diwakar Dhanuskodi 
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Cc:
Subject: Re: Best way to read XML data from RDD

Have you tried

https://github.com/databricks/spark-xml
?




On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar 
Dhanuskodi"<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> 
wrote:

Hi,

There is a RDD with json data. I could read json data using rdd.read.json . The 
json data has XML data in couple of key-value paris.

Which is the best method to read and parse XML from rdd. Is there any specific 
xml libraries for spark. Could anyone help on this.

Thanks.


Reply via email to