This is currently a little tricky - to my knowledge there's no really great way to do this.
If you know the values you're looking for, it's fairly straightforward with a UDF. Write a UDF called OCCURRENCES which takes the column and the value you're looking for and returns the number of occurrences. Then write the query as SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')... If you really need to deal with unknown column types, there's current work on an explode function which will help (see http://issues.apache.org/jira/browse/HIVE-655, http://issues.apache.org/jira/browse/HIVE-510). Once that's done, you'll be able to do something vaguely like: SELECT exploded.val, COUNT(1) FROM (SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded GROUP BY exploded.val where SPLIT_TO_ARRAY is a UDF that takes your column and parses it into an array. The explode outputs a row for every value in the array, so: ['a','b','c'] ['j','k','l'] becomes: a b c j k l Then the outer query would group and count to give you: a 5 b 100 c 34 j 1 k 22 l 55 Until that work is complete, you could do the same thing with a custom transform script that takes your column, parses it into values and outputs a row per value. See http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html. Best, Dave On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote: > You could do something like use regexp_extract to extract just commas, > and then get the length of that. But that would be a little ugly. > > But there is also a built in Array type, and there is a "size" udf > that operates on it...so maybe that could work for you. > > http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions > > On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote: > >> Hello all, >> >> Another Hive query question... :) >> >> If I have a column in a table of type STRING, and it can take on a >> comma-delimited set of values (arbitrary, and unknown at query >> time)... For example: col1 = 'a','b','c' in one row and another row >> has col1='j','k','l'... Is it possible to write a query to get the >> counts of each comma-delimited value of col1? For example, something >> like this: >> >> >> a b c j k l >> 5 100 34 1 22 54 >> >> I guess I would have to know the full set of possible values of >> col1, no? Or maybe a user-defined function? >> >> Thanks, >> Ryan >> >
