Hello,
I have a PIG script to extract sequence files using the SequenceFileLoader()
function. I can extract the XML, but when I trying parsing the XML using
ElemenTree.py or minidom.py scripts I get an error stating 'an internal error
occurred inside the function while returning'. My question is, can we parse an
output from SequenceFileLoader by directly feeding it to a UDF or the string
needs to be formatted before passing as an argument? One way is to store the
output to HDFS as an .xml file, and then use the XMLoader function in Pig to
parse, but I want to do it on the fly bypassing the store option.
register /use/lib/pig/piggybank.jar
register /use/lib64/python2.6/XML/etree/ElementTree.py using jython as myudf;
Define SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/data/appl/20142803/hq.seq' using SequenceFileLoader('/u001') as
(key:chararray, value:chararray);
b = Filter a by key == 'crt.xml';
c = Foreach b Generate myudf.fromstring(value);
dump c;
Please inform if the parsing can be done on the fly as above.
Thanking you in advance for your help in this regards.
Thanks,
Debashish Dhar
Sent from my iPhone