Hi,

Is there any REGEX UDF available for this sort of problem. Thanks in advance


On Wed, Aug 21, 2013 at 10:36 PM, Amit <[email protected]> wrote:

> Hello,
> Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the
> entire string to the input and not the parts of it. You may want to write
> your own REGEX UDF (If you are not going route suggested by Will) which
> uses Matcher.find() instead of Matcher.matches().
>
>
>
> Regards,
> Amit
>
>
>
> ________________________________
>  From: "[email protected]" <
> [email protected]>
> To: [email protected]; [email protected]
> Sent: Wednesday, August 21, 2013 12:19 PM
> Subject: RE: can't parse the values using XML loader
>
>
> Part of the problem might be that the regexp has
>
> <COMPANY>(.*)<COMPANY>
>
> but you need
> <COMPANY>(.*)</COMPANY>
>
> Using regexps to parse XML is awfully brittle. An alternative is to use a
> UDF that calls out to an XML parser. I use ElementTree from python UDFs.
>
> Will Dowling
>
> ________________________________________
> From: Muni mahesh [[email protected]]
> Sent: Wednesday, August 21, 2013 6:58 AM
> To: [email protected]; [email protected]
> Subject: can't parse the values using XML loader
>
> *Input file :*
>
> <CATALOG>
> <CD>
> <TITLE>hadoop developer</TITLE>
> <ARTIST>ajay</ARTIST>
> <COUNTRY>india</COUNTRY>
> <COMPANY>ITC</COMPANY>
> <PRICE>10.90</PRICE>
> <YEAR>2013</YEAR>
> </CD>
> </CATALOG>
>
> ============================================================================================================================================
> *Pig Script:*
>
> register /usr/lib/pig/piggybank.jar;
>
> A = load '/home/sudeep/Desktop/CATALOG.xml' using
> org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x:
> chararray);
>
>
> B = foreach A GENERATE
>
> FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>'))
> as (id: int, name:chararray);
>
>
> *Output Expected :*
>
> (hadoop, ajay, india, ITC, 10.90, 2013)
>
> *Issue :
>
> *
>
> But the output i am getting is :*
>
> ()
>
> *
>
> *I hope it is not able to parse the values between the tags
> *
>

Reply via email to