Hello, Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the entire string to the input and not the parts of it. You may want to write your own REGEX UDF (If you are not going route suggested by Will) which uses Matcher.find() instead of Matcher.matches().
Regards, Amit ________________________________ From: "[email protected]" <[email protected]> To: [email protected]; [email protected] Sent: Wednesday, August 21, 2013 12:19 PM Subject: RE: can't parse the values using XML loader Part of the problem might be that the regexp has <COMPANY>(.*)<COMPANY> but you need <COMPANY>(.*)</COMPANY> Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs. Will Dowling ________________________________________ From: Muni mahesh [[email protected]] Sent: Wednesday, August 21, 2013 6:58 AM To: [email protected]; [email protected] Subject: can't parse the values using XML loader *Input file :* <CATALOG> <CD> <TITLE>hadoop developer</TITLE> <ARTIST>ajay</ARTIST> <COUNTRY>india</COUNTRY> <COMPANY>ITC</COMPANY> <PRICE>10.90</PRICE> <YEAR>2013</YEAR> </CD> </CATALOG> ============================================================================================================================================ *Pig Script:* register /usr/lib/pig/piggybank.jar; A = load '/home/sudeep/Desktop/CATALOG.xml' using org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: chararray); B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>')) as (id: int, name:chararray); *Output Expected :* (hadoop, ajay, india, ITC, 10.90, 2013) *Issue : * But the output i am getting is :* () * *I hope it is not able to parse the values between the tags *
