I think Nick brings up some good points. Would there ever be a reason to not use UTF8 as the default from parsing a message on? All the tools we use for analytics work with UTF8 (am I wrong?).
The only case I can see needing a configurable charset would be if a message coming from a sensor were encoded in a charset other than UTF8. In that case there would need to be a configurable charset per parser (in parser config?) for decoding but the message could be encoded/decoded with UTF8 after that. Or we could just make UTF encoding a prerequisite for sending messages to Metron. On Fri, Apr 21, 2017 at 10:32 AM, Nick Allen <n...@nickallen.org> wrote: > Per (2), I think it makes sense to make the charset configurable, but with > the proposal of 3 separate settings, wouldn't things blow up horribly if > the Parsers are producing UTF-8, but Enrichment is expecting UTF-16? They > are not even speaking the same language, no? > > This makes me think that we need to understand the scenarios under which a > user would want to change the charset, before we know what kinds of levers > to bake-in. What sort of use case would drive someone to change the > charset? Would there be a particular sensor producing telemetry with a > different charset from others? If one source of telemetry is different > than the others, would the entire system even work with varying charsets? > > Without a good understanding of use cases, I think the only mildly safe > thing to do right now, is to have a single, global charset setting. The > user would have to ensure that their entire environment and all the JVMs > driving it are set to use that charset. > > Perhaps my questioning comes from a lack of understanding of charsets. I > can't remember ever having to deviate from the default. Please chime in > and educate me, if I am off base. > > > > > > > On Fri, Apr 21, 2017 at 8:50 AM, JJ Meyer <jjmey...@gmail.com> wrote: > > > Hello everybody, > > > > Currently our build has a significant amount of warnings (most from the > > error prone plugin). A lot of this has been documented here: > > https://issues.apache.org/jira/browse/METRON-617 > > > > I want to continue to work on this Jira. I really want to reduce the > number > > of warnings in our build. As the Jira points out, a lot of them are > > unchecked casts or the implicit use of default charset. > > > > When starting this, I want to start with a specific module. I plan on > > starting with `metron-interface` as it's fairly self contained and is > > pretty new. Below I want to layout what I plan on doing. Please let me > know > > what you all think: > > > > 1. Suppress warnings where generics are not supported or checking type is > > not possible. > > 2. For all unit tests that require supplying a charset I'll supply the > > UTF-8 charset. > > 3. Update the API to have a charset configuration for each resource that > > needs one. > > 4. Remove @SuppressWarnings("ALL") on all unit tests. I found out error > > prone doesn't support this level of suppression. Which is probably a good > > thing. We will need to explicitly state what we want to suppress instead. > > > > Two big questions came up right away when I was thinking about the above: > > > > 1. When suppressing warnings. I see we sometimes cast a JSONObject to > > Map<String, Object>. I know it extends Map, but is it really safe to do > > something like the following? Should we have a utility that truly builds > a > > map that uses generics? I plan on doing some testing around this, but if > > anyone has any experience with this it would be greatly appreciated. Here > > is an example of what I am talking about: > > > > JSONObject message = ...; > > @SuppressWarnings("unchecked") > > Map<String, Object> state = (Map<String, Object>) message; > > > > > > 2. This one is about configuring charset (#3 above). Specifically in > > metron-rest. Right now, I believe there are 3 sensor resources (index, > > enrichment, and parser) that throw warnings due to implicitly using the > > default charset. I propose that we have 3 configs (listed below). These > > configs would take any valid charset, default, or not set. If not set > then > > UTF-8 would be default. Does this seem fair? > > > > sensor: > > index.encoding: UTF-8 > > enrichment.encoding: UTF-8 > > parser.encoding: UTF-8 > > > > > > Does anyone see any problems that may occur if we go about it this way? > Any > > help on this would be very much appreciated. > > > > Thanks, > > JJ > > >