Hi all ,
After reading the chukwa docs , per my understanding , the log data flow
is :
adaptor-->agent-->collector-->sink file--->....
In the doc says , "* **Data in the sink may include duplicate and
omitted chunks*."
And it is not recommanded to write MapReduce jobs that directly examine
the data sink , "*becaues ** jobs will likely discard most of their input*".
Here is my question:
1. Why data in sink file include duplicate and ommitted chunks ? Because
the distributed environmrnt ?
2. How to solve the problem above ? The Simple Archiver generates the
archive file , and duplicates have been removed . So the simple archiver can
only solve the duplicate data , right?
--
Best regards,
Ivy Tang