Hi Lars, I had a similar need and came across https://issues.apache.org/jira/browse/HIVE-1304 but haven't got round to trying it yet.
Cheers, Tim On Tue, Aug 10, 2010 at 12:41 AM, Lars Francke <[email protected]> wrote: > Hi, > > I have a problem and I hope someone has an idea on how to solve it. > > My dataset consists of just very simple key-value pairs of strings > coming from PostgreSQL using Sqoop. > > 1) I need to count how often a key occurs -> Easy > 2) I need to count how often a key-value pair occurs -> Easy > > I need to output this data to PostgreSQL again, into two tables: > > a) "keys" with the columns: id, key_name, count > b) "values" with the columns: id, key_id, value_name, count > > Now the ids I'm referring to don't exist yet and I'm looking into > solutions to generate them. They have to be integers/longs but they > don't have to be in any order/pattern. I'm not concerned about > performance either as this query will be run monthly at most. > > Do you have any idea how I could introduce this new column into the > output of query 1)? I could easily introduce it into 2) with a join > then. I thought about using a custom reducer script but apart from the > fact that I've never done it so far it would require that there is > only one reducer so that I can simulate an auto-incrementer. My > current best idea is to write a regular MR job that processes the Hive > output but I'd love to do everything in Hive if possible. > > I might very well approach this problem completely wrong so don't > hesitate to propose a better solution or bash me for my poor > understanding of Hive :) > > Thanks for any input and help. > > Cheers, > Lars >
