Re: Query with comma-delimited value of a column

David Lerman Mon, 19 Oct 2009 14:24:02 -0700

This is currently a little tricky - to my knowledge there's no really great
way to do this.

If you know the values you're looking for, it's fairly straightforward with
a UDF.  Write a UDF called OCCURRENCES which takes the column and the value
you're looking for and returns the number of occurrences.  Then write the
query as

SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')...

If you really need to deal with unknown column types, there's current work
on an explode function which will help (see
http://issues.apache.org/jira/browse/HIVE-655,
http://issues.apache.org/jira/browse/HIVE-510).  Once that's done, you'll be
able to do something vaguely like:

SELECT exploded.val, COUNT(1) FROM
(SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded
GROUP BY exploded.val

where SPLIT_TO_ARRAY is a UDF that takes your column and parses it into an
array.  The explode outputs a row for every value in the array, so:

['a','b','c']
['j','k','l']

becomes:

a
b
c
j
k
l

Then the outer query would group and count to give you:

a 5
b 100
c 34
j 1
k 22
l 55

Until that work is complete, you could do the same thing with a custom
transform script that takes your column, parses it into values and outputs a
row per value.  See
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and
http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html.

Best,
Dave

On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote:

> You could do something like use regexp_extract to extract just commas,
> and then get the length of that. But that would be a little ugly.
> 
> But there is also a built in Array type, and there is a "size" udf
> that operates on it...so maybe that could work for you.
> 
> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions
> 
> On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote:
> 
>> Hello all,
>> 
>> Another Hive query question... :)
>> 
>> If I have a column in a table of type STRING, and it can take on a
>> comma-delimited set of values (arbitrary, and unknown at query
>> time)... For example: col1 = 'a','b','c' in one row and another row
>> has col1='j','k','l'... Is it possible to write a query to get the
>> counts of each comma-delimited value of col1? For example, something
>> like this:
>> 
>> 
>> a  b   c    j   k   l
>> 5 100 34  1  22 54
>> 
>> I guess I would have to know the full set of possible values of
>> col1, no? Or maybe a user-defined function?
>> 
>> Thanks,
>> Ryan
>> 
>

Re: Query with comma-delimited value of a column

Reply via email to