Github user JonZeolla commented on the issue:
https://github.com/apache/incubator-metron/pull/358
Right, I did take a look to see how realistic it would be for 4 byte
characters to be used in real world URIs and it didn't look very concerning,
which is why I went with 10922 instead of 8191. When I evaluated using 8191 I
looked through 1 week of my http logs and found numerous legitimate URI fields
which exceeded that, but none that exceeded 10922 (not that the latter doesn't
happen, it's just much more rare).
Specifically I considered that legitimate URIs will usually contain CJK
characters along with the standard URI delimiters. Because CJK characters only
use up to 3 bytes (afaict), non-CJK characters are only rarely used (i.e. if
enough of them are used that this problem is encountered, it is probably
nefarious, but definitely an extremely rare event), as well as the fact that
8191 seems to truncate some legitimate URI fields, I went with 10922.
Of course, if the project is looking to drop its BETA flag before
[METRON-542](https://issues.apache.org/jira/browse/METRON-542) (which would
truncate based on URI length upstream from indexing and have other mitigations
in place) has been implemented, I would agree that this should be 8191 for now.
If you'd still prefer 8191 after considering my previous comments, I'd be
more than happy to swap it out, but I wanted to make sure to note that it may
prevent a higher percent of legitimate URI fields from being indexed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---