Hello,
Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the 
entire string to the input and not the parts of it. You may want to write your 
own REGEX UDF (If you are not going route suggested by Will) which uses 
Matcher.find() instead of Matcher.matches().


 
Regards,
Amit



________________________________
 From: "[email protected]" <[email protected]>
To: [email protected]; [email protected] 
Sent: Wednesday, August 21, 2013 12:19 PM
Subject: RE: can't parse the values using XML loader
 

Part of the problem might be that the regexp has

<COMPANY>(.*)<COMPANY>

but you need
<COMPANY>(.*)</COMPANY>

Using regexps to parse XML is awfully brittle. An alternative is to use a UDF 
that calls out to an XML parser. I use ElementTree from python UDFs.

Will Dowling

________________________________________
From: Muni mahesh [[email protected]]
Sent: Wednesday, August 21, 2013 6:58 AM
To: [email protected]; [email protected]
Subject: can't parse the values using XML loader

*Input file :*

<CATALOG>
<CD>
<TITLE>hadoop developer</TITLE>
<ARTIST>ajay</ARTIST>
<COUNTRY>india</COUNTRY>
<COMPANY>ITC</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>2013</YEAR>
</CD>
</CATALOG>
============================================================================================================================================
*Pig Script:*

register /usr/lib/pig/piggybank.jar;

A = load '/home/sudeep/Desktop/CATALOG.xml' using
org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x:
chararray);


B = foreach A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>'))
as (id: int, name:chararray);


*Output Expected :*

(hadoop, ajay, india, ITC, 10.90, 2013)

*Issue :

*

But the output i am getting is :*

()

*

*I hope it is not able to parse the values between the tags
*

Reply via email to