I think it's fine to have the semantics remain in the Demux, in that case perhaps the Demux processors can look at the sar-generated column labels to determine what are the units, and to standardize the units of the output?
Jiaqi On Fri, May 22, 2009 at 10:56 AM, Eric Yang <[email protected]> wrote: > Many solutions have been suggested in the past year, but there isn't one > fits all. Most of the promising library is in the GPL camp. Unfortunately, > we can't use those. The closest thing in the apache camp is the ganglia > metrics library. There are 2 bugs that they need to fix in the metrics > library. First, it uses float to store all values, hence the accuracy > becomes somewhat questionable for large values. Second, one of the metrics > only include value from first device. I forget it's either network device > or disk. I dropped integration of ganglia metrics library after discovering > those bugs. However, we might want to revisit this, if it has been > improved. For the windows camp, we may need a completely different solution > for measuring system metrics. > > I believe all parsing logic and data schematics should happen in demux > parser rather than MDL. Personally, I believe MDL should have zero > configuration. MDL's purpose is to load data into database by knowing the > RecordType=Table, Key=Column, Value=Value. This will definitely reduce the > places that we maintain data transformation. The data schematics should > happen in demux parser, and database_create_table.sql only. What do you > guys think? > > Regards, > Eric > > On 5/21/09 11:01 PM, "Ariel Rabkin" <[email protected]> wrote: > >> Howdy. >> >> I agree with your diagnosis -- this is the peril of external >> dependencies. There was discussion, back in the day, about doing >> something better. Poking at /proc is certainly one option. Another >> would be finding some apache-licensed library that does this. Sigar >> would fit the bill, but it's GPLed and so we can't link against it. >> Though there was discussion under HADOOP-4959 about a license >> exemption. That might solve our problem. >> >> There's a Java standard approach that does some subset of what we want >> -- >> http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/managem >> ent/UnixOperatingSystemMXBean.html >> >> What's peculiar about this issue is that right now, the actual Demux >> processors are largely independent of the versions -- those processors >> make assumptions about the syntax of the input, but almost none about >> the semantics. If the data comes in columns with headers, they do >> basically the right thing. However, when it comes time to do the >> database insert, the column names don't match the ones in mdl.xml, and >> so things start to fail. >> >> It seems a pity to dirty up the currently clean Java code with lots of >> special cases for canonical-izing data formats. I'm okay doing some >> sort of parameterization, but I think in a lot of cases we can do >> something very simpleminded and still be okay. Perhaps as simple as >> "if you see field x in a SystemMetrics record, output field y as >> follows." >> >> On Thu, May 21, 2009 at 10:27 PM, Jiaqi Tan <[email protected]> wrote: >>> Hi Ari, >>> >>> I think the real problem here is that sar metrics are being picked up >>> by an Exec adaptor which calls sar and there's no control over which >>> sar gets called (or at least not right now), and sar is ultimately an >>> external dependency which currently is just assumed to be sitting >>> there. >>> >>> Also, sar just directly emits unstructured plain text, so there's no >>> self-describing data format a la some XML which says what the units >>> are, so if sar is changing output units and stuff, then the parser in >>> the Demux needs to take care of that too. Even more generally, even >>> any change at all to sar's output would require an update of the >>> Demux. >>> >>> I think the fundamental problem is that having an Exec adaptor which >>> pulls the unstructured output of an external program and having a >>> Demux processor that makes assumptions about what that output looks >>> like and what it means, makes the whole workflow dependent on >>> something not under the control of Chukwa. >>> >>> I can imagine one way of working around that would be to not use sar >>> and write custom parsers for /proc so that Chukwa is itself aware of >>> what the proc data actually means without having to make assumptions >>> about the output of an external parser; it's reinventing the wheel >>> somewhat but it gives an end-to-end cleaner solution. >>> >>> The other answer would perhaps be the "web services" answer of having >>> a whole standardized way of passing data around in a structured way >>> but then that starts to look like a generalized pub/sub system. >>> >>> But in the meantime maybe the sar version on the system being >>> monitored could be picked up in some way (metadata in the Chunk?) and >>> the various Demux processors dependent on such external programs e.g. >>> IoStat, Df, etc. could be parameterized to handle output from >>> different versions/variants of the source program. Or to be even more >>> general, the Exec adaptor could send along an MD5 hash of the program >>> it's calling, and then you'd have a whole bunch of processors for >>> every possible variant of the program you want to support; that sounds >>> terribly hackish to me but I think that way at least the identity of >>> the external dependency can be identified. >>> > >
