[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Ciemiewicz updated PIG-826: --------------------------------- Summary: DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig (was: DISTINCT as "Function" rather than statement - High Level Pig) > DISTINCT as "Function/Operator" rather than statement/operator - High Level > Pig > ------------------------------------------------------------------------------- > > Key: PIG-826 > URL: https://issues.apache.org/jira/browse/PIG-826 > Project: Pig > Issue Type: New Feature > Reporter: David Ciemiewicz > > In SQL, a user would think nothing of doing something like: > {code} > select > COUNT(DISTINCT(user)) as user_count, > COUNT(DISTINCT(country)) as country_count, > COUNT(DISTINCT(url) as url_count > from > server_logs; > {code} > But in Pig, we'd need to do something like the following. And this is about > the most > compact version I could come up with. > {code} > Logs = load 'log' using PigStorage() > as ( user: chararray, country: chararray, url: chararray); > DistinctUsers = distinct (foreach Logs generate user); > DistinctCountries = distinct (foreach Logs generate country); > DistinctUrls = distinct (foreach Logs generate url); > DistinctUsersCount = foreach (group DistinctUsers all) generate > group, COUNT(DistinctUsers) as user_count; > DistinctCountriesCount = foreach (group DistinctCountries all) generate > group, COUNT(DistinctCountries) as country_count; > DistinctUrlCount = foreach (group DistinctUrls all) generate > group, COUNT(DistinctUrls) as url_count; > AllDistinctCounts = cross > DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount; > Report = foreach AllDistinctCounts generate > DistinctUsersCount::user_count, > DistinctCountriesCount::country_count, > DistinctUrlCount::url_count; > store Report into 'log_report' using PigStorage(); > {code} > It would be good if there was a higher level version of Pig that permitted > code to be written as: > {code} > Logs = load 'log' using PigStorage() > as ( user: chararray, country: chararray, url: chararray); > Report = overall Logs generate > COUNT(DISTINCT(user)) as user_count, > COUNT(DISTINCT(country)) as country_count, > COUNT(DISTINCT(url)) as url_count; > store Report into 'log_report' using PigStorage(); > {code} > I do want this in Pig and not as SQL. I'd expect High Level Pig to generate > Lower Level Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.