Dear Hive Experts,
I am relatively new to Hadoop and Hive. I have written some hive scripts for gathering a subset of data. I have a new requirement and I need your advice on how to proceed and references to any sample scripts doing the same work. - Foo_hive table consists of 150 million records with key and some other columns. - There are many sequence files organized in the following directory structure o /foo/year/month/day/attempt_number/part_1…n The data in these files is organized as key, column1, column2, … - Other than the key column, there are other columns are same between the sequence files and the hive table. - One of the column in the sequence file contains large amounts of data (10-30K) encoded. - Multiple sequence files might contain rows with the same key Requirement: - Select subset of data from hive table based on values of columns of hive table, join the selected rows with the columns from the sequence files satisfying: o The latest row from the sequence file needs to be selected… Example: if no file before 07/25/attempt_1 has a matching key1, then the program should stop looking for key1. - Ability to specify a directory structure as input for sequence files. Example: /foo/2010/07 to limit the join should limit to the sequence files for only the month of July 2010. - Ability to partition/cluster the output data by certain criteria. Example: All matching output with certain values of hive table columns should go to a particular partition. The reason is the total size of the matching output could be a few TB. Would like to split the output data evenly so that the subsequence processes could be parallized. I am more than happy to give more details about my requirement. Thanks for your help. -Raman
