Dear Hive Experts,


I am relatively new to Hadoop and Hive. I have written some hive scripts for
gathering a subset of data. I have a new requirement and I need your advice
on how to proceed and references to any sample scripts doing the same work.



-          Foo_hive  table consists of 150 million records with key and some
other columns.

-          There are many sequence files organized in the following
directory structure

o   /foo/year/month/day/attempt_number/part_1…n

The data in these files is organized as key, column1, column2, …

-          Other than the key column, there are other columns are same
between the sequence files and the hive table.

-          One of the column in the sequence file contains large amounts of
data (10-30K) encoded.

-          Multiple sequence files might contain rows with the same key





Requirement:





-          Select subset of data from hive table based on values of columns
of hive table, join the selected rows with the columns from the sequence
files satisfying:

o   The latest row from the sequence file needs to be selected… Example: if
no file before 07/25/attempt_1 has a matching key1, then the program should
stop looking for key1.

-          Ability to specify a directory structure as input for sequence
files. Example: /foo/2010/07 to limit the join should limit to the sequence
files for only the month of July 2010.

-          Ability to partition/cluster the output data by certain criteria.
Example: All matching output with certain values of hive table columns
should go to a particular partition. The reason is the total size of the
matching output could be a few TB. Would like to split the output data
evenly so that the subsequence processes could be parallized.



I am more than happy to give more details about my requirement. Thanks for
your help.



-Raman

Reply via email to