siddhartha Pattanaik created PIG-3119: -----------------------------------------
Summary: REGEX_EXTRACT_ALL custom with aggregation function Key: PIG-3119 URL: https://issues.apache.org/jira/browse/PIG-3119 Project: Pig Issue Type: Bug Components: build, grunt Affects Versions: 0.9.1 Environment: OS -version ================================ Linux version 2.6.18-194.3.1.el5 (mockbu...@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) software installed ======================= hadoop-1.0.4 pig-0.9.1 Hardware details ==================================== processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz stepping : 4 cpu MHz : 2800.098 cache size : 8192 KB fpu : yes fpu_exception : yes cpuid level : 11 Reporter: siddhartha Pattanaik Priority: Critical Fix For: 0.9.1 Hi , I have a use case in my project requirement, The i/p file consist of the following pattern:- 192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3; __utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786; __utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=aHX_NQheRq08" "-" 0 I want to run a aggregate function along with regex_extract_all to extract the desired data. Even though the i/p file is parsing.I have issue with aggregate function working on it. Please find the below pig script:- ***************Ip_adress-count************************ Ip_adress_count.pig A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray); B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "([^"]*)" (\\S+) ') ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, Mozilla: chararray, wookie_cookie: chararray, browser3: chararray, acess_status:int ); C = group B by remoteAddr; D = foreach C generate COUNT(B) as ip_adress_count; E = order D by ip_adress_count; F = STORE E INTO ‘ip_adress_count/' using PigStorage(','); Expected O/p =========================== ip_adress_count remoteAddr,ip_adress_count 192.168.90.36,19 192.168.90.37,1 There is no parsing issue but the aggregate function count() is not working over the regex_extract_all function for regular expression. Please do the need.The requirement is I need the count of the ip adresses from the ip data. thanks, siddharth contact -8763666372 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira