Hi, Is there any REGEX UDF available for this sort of problem. Thanks in advance
On Wed, Aug 21, 2013 at 10:36 PM, Amit <[email protected]> wrote: > Hello, > Moreover REGEX_EXTRACT_ALL uses Matcher.matches() which tries to match the > entire string to the input and not the parts of it. You may want to write > your own REGEX UDF (If you are not going route suggested by Will) which > uses Matcher.find() instead of Matcher.matches(). > > > > Regards, > Amit > > > > ________________________________ > From: "[email protected]" < > [email protected]> > To: [email protected]; [email protected] > Sent: Wednesday, August 21, 2013 12:19 PM > Subject: RE: can't parse the values using XML loader > > > Part of the problem might be that the regexp has > > <COMPANY>(.*)<COMPANY> > > but you need > <COMPANY>(.*)</COMPANY> > > Using regexps to parse XML is awfully brittle. An alternative is to use a > UDF that calls out to an XML parser. I use ElementTree from python UDFs. > > Will Dowling > > ________________________________________ > From: Muni mahesh [[email protected]] > Sent: Wednesday, August 21, 2013 6:58 AM > To: [email protected]; [email protected] > Subject: can't parse the values using XML loader > > *Input file :* > > <CATALOG> > <CD> > <TITLE>hadoop developer</TITLE> > <ARTIST>ajay</ARTIST> > <COUNTRY>india</COUNTRY> > <COMPANY>ITC</COMPANY> > <PRICE>10.90</PRICE> > <YEAR>2013</YEAR> > </CD> > </CATALOG> > > ============================================================================================================================================ > *Pig Script:* > > register /usr/lib/pig/piggybank.jar; > > A = load '/home/sudeep/Desktop/CATALOG.xml' using > org.apache.pig.piggybank.storage.XMLLoader('CATALOG') as (x: > chararray); > > > B = foreach A GENERATE > > FLATTEN(REGEX_EXTRACT_ALL(x,'<CATALOG>\\n*<CD>\\n<TITLE>(.*)</TITLE>\\n\\s*<ARTIST>(.*)</ARTIST>\\n\\s*<COUNTRY>(.*)</COUNTRY>\\n\\s*<COMPANY>(.*)<COMPANY>\\n\\s*<PRICE>(.*)</PRICE>\\n\\s*<YEAR>(.*)</YEAR>\\n\\s*</CD>\\n*</CATALOG>')) > as (id: int, name:chararray); > > > *Output Expected :* > > (hadoop, ajay, india, ITC, 10.90, 2013) > > *Issue : > > * > > But the output i am getting is :* > > () > > * > > *I hope it is not able to parse the values between the tags > * >
