Fantastic -- the transform is exactly what I need for this case! Thanks, and I'll also keep an eye on HIVE-655 and HIVE-510!
Ryan On Mon, Oct 19, 2009 at 5:23 PM, David Lerman <[email protected]> wrote: > This is currently a little tricky - to my knowledge there's no really great > way to do this. > > If you know the values you're looking for, it's fairly straightforward with > a UDF. Write a UDF called OCCURRENCES which takes the column and the value > you're looking for and returns the number of occurrences. Then write the > query as > > SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')... > > If you really need to deal with unknown column types, there's current work > on an explode function which will help (see > http://issues.apache.org/jira/browse/HIVE-655, > http://issues.apache.org/jira/browse/HIVE-510). Once that's done, you'll > be > able to do something vaguely like: > > SELECT exploded.val, COUNT(1) FROM > (SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded > GROUP BY exploded.val > > where SPLIT_TO_ARRAY is a UDF that takes your column and parses it into an > array. The explode outputs a row for every value in the array, so: > > ['a','b','c'] > ['j','k','l'] > > becomes: > > a > b > c > j > k > l > > Then the outer query would group and count to give you: > > a 5 > b 100 > c 34 > j 1 > k 22 > l 55 > > Until that work is complete, you could do the same thing with a custom > transform script that takes your column, parses it into values and outputs > a > row per value. See > http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and > http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html. > > Best, > Dave > > On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote: > > > You could do something like use regexp_extract to extract just commas, > > and then get the length of that. But that would be a little ugly. > > > > But there is also a built in Array type, and there is a "size" udf > > that operates on it...so maybe that could work for you. > > > > > http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions > > > > On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote: > > > >> Hello all, > >> > >> Another Hive query question... :) > >> > >> If I have a column in a table of type STRING, and it can take on a > >> comma-delimited set of values (arbitrary, and unknown at query > >> time)... For example: col1 = 'a','b','c' in one row and another row > >> has col1='j','k','l'... Is it possible to write a query to get the > >> counts of each comma-delimited value of col1? For example, something > >> like this: > >> > >> > >> a b c j k l > >> 5 100 34 1 22 54 > >> > >> I guess I would have to know the full set of possible values of > >> col1, no? Or maybe a user-defined function? > >> > >> Thanks, > >> Ryan > >> > > > >
