I see.... I'm changing then the approach to use backfill and use the regular demux for converting the log to records. What I want to skip is the agent / collector part since I will be using syslog-ng for gathering data. I thought that if the data is more or less in order, I could skip the map / redux part, and insert in HDFS directly ChukwaRecords, but if it will be slower is a no sense.
The backfill works pretty well, renames files, so I know when I can remove them from the local disk very easily. One related thing is that I want to modify the "cluster" where we put the files, because we will receive syslog data with several types of events that we want to store in different clusters to analyze, backup, archive separately. I have seen that you can modify the Record.tagsField and that we use a regexp for extracting the destination cluster. This is a bit akward, isn't? I don't want to keep a tagsField just for that. I'm using a field "event_type" and I have modified the extraction/engine/RecordUtil.java, so if that field exists, "event_" + <event_type> will be used as cluster. This is the proper way to go, or there is a better solution for this?. Another question is where I could start looking on how to build reports and aggregated results of the custom ChukwaRecords I'm inserting. -- Guille -ℬḭṩḩø- <bi...@tuenti.com> :wq