Part of the problem might be that the regexp has <COMPANY>(.*)<COMPANY>
but you need <COMPANY>(.*)</COMPANY> Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs. Will Dowling ________________________________________ From: Muni mahesh [[email protected]] Sent: Wednesday, August 21, 2013 6:58 AM To: [email protected]; [email protected] Subject: can't parse the values using XML loader *Input file :* <CATALOG> <CD> <TITLE>hadoop developer</TITLE> <ARTIST>ajay</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>ITC</COMPANY> <PRICE>10.90</PRICE> <YEAR>2013</YEAR> </CD> </CATALOG> ============================================================================================================================================ *Pig Script:* register /usr/lib/pig/piggybank.jar; A = load '/home/sudeep/Desktop/CATALOG.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>')) as (id: int, name:chararray); *Output Expected :* (hadoop, ajay, india, ITC, 10.90, 2013) *Issue : * But the output i am getting is :* () * *I hope it is not able to parse the values between the tags *
