Hi all, I am processing a large tab file to format it suitable for loading into a database with a predefined schema. I have a tab file with a column that I need to normalize out to another table and reference it with a foreign key from the original file. I would like to hear if my proposed processing is suitable:
Original file has: ID, .... Institution_Code .... I need to normalize out the distinct Institution Codes to a separate table and have an Institution_Code_ID in the original. Job 1: Generate a file of Institution_Code:<ID> Job 2: Use output of Job1 and using the row number as the Institution_Code_ID, generate ID:rowNumber Job 3: do a reduce side join of the original input and the output of job 2 to generate ID, ..... InstitutionCodeID .... Job 1 output can then be used to populate the institutionCodes table, and Job 3 will have the appropriate FK so can populate the main table. Is this approach sane? Are there better approaches? Many thanks, Tim
