Re: [DISCUSS] Charset PR merge effects

2019-02-25 Thread Michael Miklavcic
Justin, I like your proposal. I just had a couple thoughts that I think are
worth pointing out. I'm not 100% clear about this point, but I don't
believe we currently handle the possibility of various external data
sources with heterogeneous encoding schemes in our Kafka topics. As it
stands, I believe it's up to users to ensure their external systems have
the same encoding as the machines hosting Metron (since we use the default
charset). It seems we haven't run into issues with this yet, but it should
be noted that if we change to UTF-8 while the external systems continue to
use a different default, e.g. ISO-8859-1, things will work fine for ASCII
chars 0-127, and then break for 128-255 where ISO-8859-1 still uses a
single-byte representation, but UTF-8 uses a 2-byte representation.
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
https://en.wikipedia.org/wiki/UTF-8

I don't believe this change breaks anything - HDFS/Hadoop already uses
UTF-8 as its default encoding and we'd likely have seen or heard about
issues, but it's worth mentioning that there may be some gotchas at the
edges outside of Metron. I did a quick search in your PR and didn't see any
README updates - the only thing I might recommend is a release notice that
Metron is changing our encoding handling to be more explicitly opinionated
and will use UTF-8 rather than the default charset.

Best,
Mike


On Wed, Feb 20, 2019 at 8:19 AM Justin Leet  wrote:

> Hey all,
>
> I wanted to bring a bit of attention to a change and its effects before I
> push to master.
>
> PR#1341  Removes all uses of
> the default charset (which is platform dependent) and moves everything to
> UTF-8. This PR currently has a +1, but obviously any new input is more than
> welcome.
>
> Going forward, it will be a compiler error to use the default charset (e.g.
> getBytes without StandardCharsets.UTF_8, or using APIs that don't allow an
> encoding to be passed. I want to call attention to the fact that existing
> PRs may break after this merge.
>
> Post merging this PR, I would send another email to the list (and Slack and
> so on), to note that existing PRs should merge in master (or at least
> determine if they need to).
>
> Does anyone have any objections or concerns? Should I be more proactive
> about a heads up on PRs, or are we okay with what I'm proposing?
>


[DISCUSS] Charset PR merge effects

2019-02-20 Thread Justin Leet
Hey all,

I wanted to bring a bit of attention to a change and its effects before I
push to master.

PR#1341  Removes all uses of
the default charset (which is platform dependent) and moves everything to
UTF-8. This PR currently has a +1, but obviously any new input is more than
welcome.

Going forward, it will be a compiler error to use the default charset (e.g.
getBytes without StandardCharsets.UTF_8, or using APIs that don't allow an
encoding to be passed. I want to call attention to the fact that existing
PRs may break after this merge.

Post merging this PR, I would send another email to the list (and Slack and
so on), to note that existing PRs should merge in master (or at least
determine if they need to).

Does anyone have any objections or concerns? Should I be more proactive
about a heads up on PRs, or are we okay with what I'm proposing?