Looking for advice on file structure...

Andy Sautins Tue, 18 Aug 2009 13:47:49 -0700

   All,

   I'm looking for a little bit of advice on how to format files.


   The problem I have is I have log files from a number of different sources.  
The data elements between log files overlaps by about 80%, but there are unique 
data items in each of the log files that I want to keep and be able to access 
from my Map/Reduce jobs.  There also isn't a single obvious key to the log file 
entries.  A quick example would be two different log files.  Log file a has 3 
columns of data types A,B,C and is tab delimited.  Log file 2 has data types 
A,B,C,D and is pipe delimited.  I'd like to pre-process them into files where 
in the map/reduce job I could consistently access data element A across both 
types of log files and also access element D if it exists.

    .I suspect the best answer would be to pre-process the files into a common 
file format that allows for variable data values within a log line.   What I'm 
wondering is, has anyone else solved this type of problem and did you find a 
solution you liked?

   Where I've been looking so far is to use SequenceFiles.  There isn't a 
logical key, so the key in the sequence file my thought was to just have a line 
number, similar to the default map file input format although that feels a 
little weird.  For the value, since I want somewhat arbitrary key/values for 
the SequenceFile value my thought was to just have the value as a serialized 
HashMap.

    Any thoughts as to if I'm trying to re-invent the wheel here or going off 
in a strange direction?

    Thanks

    Andy

Looking for advice on file structure...

Reply via email to