Personally, I think it makes sense to have an inbound/outbound charset
setting for the parsers. We should generally standardize on UTF-8 across
Metron and its topologies, but accept potentially different charsets from
the inbound sensors. That is to say, standardize the defaults for the
things we do control, and handle edge cases for the things we don't. And
UTF-8 is fully compatible with ASCII without requiring a mapping, so this
should generally work.
http://stackoverflow.com/questions/15965811/why-utf8-is-compatible-with-ascii.
Even with defaults, I think we should still let users configure this to
whatever they want, but offer the warning of what will happen if the
expected charsets don't match up.

On Fri, Apr 21, 2017 at 9:32 AM, Nick Allen <n...@nickallen.org> wrote:

> Per (2), I think it makes sense to make the charset configurable, but with
> the proposal of 3 separate settings, wouldn't things blow up horribly if
> the Parsers are producing UTF-8, but Enrichment is expecting UTF-16?  They
> are not even speaking the same language, no?
>
> This makes me think that we need to understand the scenarios under which a
> user would want to change the charset, before we know what kinds of levers
> to bake-in.  What sort of use case would drive someone to change the
> charset?  Would there be a particular sensor producing telemetry with a
> different charset from others?  If one source of telemetry is different
> than the others, would the entire system even work with varying charsets?
>
> Without a good understanding of use cases, I think the only mildly safe
> thing to do right now, is to have a single, global charset setting.  The
> user would have to ensure that their entire environment and all the JVMs
> driving it are set to use that charset.
>
> Perhaps my questioning comes from a lack of understanding of charsets.  I
> can't remember ever having to deviate from the default.  Please chime in
> and educate me, if I am off base.
>
>
>
>
>
>
> On Fri, Apr 21, 2017 at 8:50 AM, JJ Meyer <jjmey...@gmail.com> wrote:
>
> > Hello everybody,
> >
> > Currently our build has a significant amount of warnings (most from the
> > error prone plugin). A lot of this has been documented here:
> > https://issues.apache.org/jira/browse/METRON-617
> >
> > I want to continue to work on this Jira. I really want to reduce the
> number
> > of warnings in our build. As the Jira points out, a lot of them are
> > unchecked casts or the implicit use of default charset.
> >
> > When starting this, I want to start with a specific module. I plan on
> > starting with `metron-interface` as it's fairly self contained and is
> > pretty new. Below I want to layout what I plan on doing. Please let me
> know
> > what you all think:
> >
> > 1. Suppress warnings where generics are not supported or checking type is
> > not possible.
> > 2. For all unit tests that require supplying a charset I'll supply the
> > UTF-8 charset.
> > 3. Update the API to have a charset configuration for each resource that
> > needs one.
> > 4. Remove @SuppressWarnings("ALL") on all unit tests. I found out error
> > prone doesn't support this level of suppression. Which is probably a good
> > thing. We will need to explicitly state what we want to suppress instead.
> >
> > Two big questions came up right away when I was thinking about the above:
> >
> > 1. When suppressing warnings. I see we sometimes cast a JSONObject to
> > Map<String, Object>. I know it extends Map, but is it really safe to do
> > something like the following? Should we have a utility that truly builds
> a
> > map that uses generics? I plan on doing some testing around this, but if
> > anyone has any experience with this it would be greatly appreciated. Here
> > is an example of what I am talking about:
> >
> > JSONObject message = ...;
> > @SuppressWarnings("unchecked")
> > Map<String, Object> state = (Map<String, Object>) message;
> >
> >
> > 2. This one is about configuring charset (#3 above). Specifically in
> > metron-rest. Right now, I believe there are 3 sensor resources (index,
> > enrichment, and parser) that throw warnings due to implicitly using the
> > default charset. I propose that we have 3 configs (listed below). These
> > configs would take any valid charset, default, or not set. If not set
> then
> > UTF-8 would be default. Does this seem fair?
> >
> > sensor:
> >   index.encoding: UTF-8
> >   enrichment.encoding: UTF-8
> >   parser.encoding: UTF-8
> >
> >
> > Does anyone see any problems that may occur if we go about it this way?
> Any
> > help on this would be very much appreciated.
> >
> > Thanks,
> > JJ
> >
>

Reply via email to