Re: Query with comma-delimited value of a column

Bobby Rullo Mon, 19 Oct 2009 15:22:19 -0700

Ah, I misunderstood the question.

I thought he was just looking for a count of the items in each row.
On Oct 19, 2009, at 2:23 PM, David Lerman wrote:

This is currently a little tricky - to my knowledge there's noreally great

way to do this.

If you know the values you're looking for, it's fairlystraightforward witha UDF. Write a UDF called OCCURRENCES which takes the column andthe valueyou're looking for and returns the number of occurrences. Thenwrite the

query as

SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')...

If you really need to deal with unknown column types, there'scurrent work

on an explode function which will help (see
http://issues.apache.org/jira/browse/HIVE-655,

http://issues.apache.org/jira/browse/HIVE-510). Once that's done,you'll be

able to do something vaguely like:

SELECT exploded.val, COUNT(1) FROM
(SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded
GROUP BY exploded.val

where SPLIT_TO_ARRAY is a UDF that takes your column and parses itinto an

array.  The explode outputs a row for every value in the array, so:

['a','b','c']
['j','k','l']

becomes:

a
b
c
j
k
l

Then the outer query would group and count to give you:

a 5
b 100
c 34
j 1
k 22
l 55

Until that work is complete, you could do the same thing with a custom

transform script that takes your column, parses it into values andoutputs a

row per value.  See
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and
http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html.

Best,
Dave

On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote:

You could do something like use regexp_extract to extract justcommas,

and then get the length of that. But that would be a little ugly.

But there is also a built in Array type, and there is a "size" udf
that operates on it...so maybe that could work for you.

http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions

On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote:

Hello all,

Another Hive query question... :)

If I have a column in a table of type STRING, and it can take on a
comma-delimited set of values (arbitrary, and unknown at query
time)... For example: col1 = 'a','b','c' in one row and another row
has col1='j','k','l'... Is it possible to write a query to get the
counts of each comma-delimited value of col1? For example, something
like this:


a  b   c    j   k   l
5 100 34  1  22 54

I guess I would have to know the full set of possible values of
col1, no? Or maybe a user-defined function?

Thanks,
Ryan

Re: Query with comma-delimited value of a column

Reply via email to