Fantastic -- the transform is exactly what I need for this case!

Thanks, and I'll also keep an eye on HIVE-655 and HIVE-510!

Ryan


On Mon, Oct 19, 2009 at 5:23 PM, David Lerman <[email protected]> wrote:

> This is currently a little tricky - to my knowledge there's no really great
> way to do this.
>
> If you know the values you're looking for, it's fairly straightforward with
> a UDF.  Write a UDF called OCCURRENCES which takes the column and the value
> you're looking for and returns the number of occurrences.  Then write the
> query as
>
> SELECT OCCURRENCES(col1, 'a'), OCCURRENCES(col1, 'b')...
>
> If you really need to deal with unknown column types, there's current work
> on an explode function which will help (see
> http://issues.apache.org/jira/browse/HIVE-655,
> http://issues.apache.org/jira/browse/HIVE-510).  Once that's done, you'll
> be
> able to do something vaguely like:
>
> SELECT exploded.val, COUNT(1) FROM
> (SELECT EXPLODE(SPLIT_TO_ARRAY(col1)) AS val) exploded
> GROUP BY exploded.val
>
> where SPLIT_TO_ARRAY is a UDF that takes your column and parses it into an
> array.  The explode outputs a row for every value in the array, so:
>
> ['a','b','c']
> ['j','k','l']
>
> becomes:
>
> a
> b
> c
> j
> k
> l
>
> Then the outer query would group and count to give you:
>
> a 5
> b 100
> c 34
> j 1
> k 22
> l 55
>
> Until that work is complete, you could do the same thing with a custom
> transform script that takes your column, parses it into values and outputs
> a
> row per value.  See
> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform and
> http://dev.bizo.com/2009/07/custom-map-scripts-and-hive.html.
>
> Best,
> Dave
>
> On 10/19/09 4:45 PM, "Bobby Rullo" <[email protected]> wrote:
>
> > You could do something like use regexp_extract to extract just commas,
> > and then get the length of that. But that would be a little ugly.
> >
> > But there is also a built in Array type, and there is a "size" udf
> > that operates on it...so maybe that could work for you.
> >
> >
> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Collection_Functions
> >
> > On Oct 19, 2009, at 1:22 PM, Ryan LeCompte wrote:
> >
> >> Hello all,
> >>
> >> Another Hive query question... :)
> >>
> >> If I have a column in a table of type STRING, and it can take on a
> >> comma-delimited set of values (arbitrary, and unknown at query
> >> time)... For example: col1 = 'a','b','c' in one row and another row
> >> has col1='j','k','l'... Is it possible to write a query to get the
> >> counts of each comma-delimited value of col1? For example, something
> >> like this:
> >>
> >>
> >> a  b   c    j   k   l
> >> 5 100 34  1  22 54
> >>
> >> I guess I would have to know the full set of possible values of
> >> col1, no? Or maybe a user-defined function?
> >>
> >> Thanks,
> >> Ryan
> >>
> >
>
>

Reply via email to