Ah, I misunderstood the question.
I thought he was just looking for a count of the items in each row.
On Oct 19, 2009, at 2:23 PM, David Lerman wrote:
This is currently a little tricky - to my knowledge there's no
really great
way to do this.
If you know the values you're looking for, it's fairly
straightforward with
a UDF. Write a UDF called OCCURRENCES which takes the column and
the value
you're looking for and returns the number of occurrences. Then
write the
query as
SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')...
If you really need to deal with unknown column types, there's
current work
on an explode function which will help (see
http://issues.apache.org/jira/browse/HIVE-655,
http://issues.apache.org/jira/browse/HIVE-510). Once that's done,
you'll be
able to do something vaguely like:
SELECT exploded.val, COUNT(1) FROM
(SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded
GROUP BY exploded.val
where SPLIT_TO_ARRAY is a UDF that takes your column and parses it
into an
array. The explode outputs a row for every value in the array, so:
['a','b','c']
['j','k','l']
becomes:
a
b
c
j
k
l
Then the outer query would group and count to give you:
a 5
b 100
c 34
j 1
k 22
l 55
Until that work is complete, you could do the same thing with a custom
transform script that takes your column, parses it into values and
outputs a
row per value. See
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and
http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html.
Best,
Dave
On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote:
You could do something like use regexp_extract to extract just
commas,
and then get the length of that. But that would be a little ugly.
But there is also a built in Array type, and there is a "size" udf
that operates on it...so maybe that could work for you.
http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions
On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote:
Hello all,
Another Hive query question... :)
If I have a column in a table of type STRING, and it can take on a
comma-delimited set of values (arbitrary, and unknown at query
time)... For example: col1 = 'a','b','c' in one row and another row
has col1='j','k','l'... Is it possible to write a query to get the
counts of each comma-delimited value of col1? For example, something
like this:
a b c j k l
5 100 34 1 22 54
I guess I would have to know the full set of possible values of
col1, no? Or maybe a user-defined function?
Thanks,
Ryan