All,
I'm looking for a little bit of advice on how to format files.
The problem I have is I have log files from a number of different sources.
The data elements between log files overlaps by about 80%, but there are unique
data items in each of the log files that I want to keep and be able to access
from my Map/Reduce jobs. There also isn't a single obvious key to the log file
entries. A quick example would be two different log files. Log file a has 3
columns of data types A,B,C and is tab delimited. Log file 2 has data types
A,B,C,D and is pipe delimited. I'd like to pre-process them into files where
in the map/reduce job I could consistently access data element A across both
types of log files and also access element D if it exists.
.I suspect the best answer would be to pre-process the files into a common
file format that allows for variable data values within a log line. What I'm
wondering is, has anyone else solved this type of problem and did you find a
solution you liked?
Where I've been looking so far is to use SequenceFiles. There isn't a
logical key, so the key in the sequence file my thought was to just have a line
number, similar to the default map file input format although that feels a
little weird. For the value, since I want somewhat arbitrary key/values for
the SequenceFile value my thought was to just have the value as a serialized
HashMap.
Any thoughts as to if I'm trying to re-invent the wheel here or going off
in a strange direction?
Thanks
Andy