mmiklavc commented on issue #1409: METRON-2112 Normalize parser original_string handling URL: https://github.com/apache/metron/pull/1409#issuecomment-494991158 > I think we should simply add a new field for the source bytes or whatever, and never have the parsers even know about it. > > original_string, though perhaps poorly named should remain the domain of the parsers. > > the we documented it like > " original_string is to be representative of the string parsed to produce the record it resides in, such that the parse, if passed that string again would produce the same result, although being the domain of the parser that generated it, it may be some other markup or preparsed stateā It might be worth going back to the DISCUSS thread with this convo, but here are my current thoughts. First, we're potentially adding the raw source bytes 2x if we do this. It's also not immediately obvious to me all the implications of converting the current `original_string` to a new purpose while also adding a new `original_string` without doing a fairly involved investigation. I do think that's worthwhile, ultimately. But getting back to the original problem at hand, I don't think it makes sense to make this the immediate solution to fix our regression. If we don't like this configuration approach, I would like to offer a couple much simpler solutions: 1. Skip the global change proposed by @nickwallen and @simonellistonball for now. Offer this config option purely for the JsonMapParser, as that's the only place we have a must-fix regression currently 2. Don't offer any config option, instead just remove the specialized original_string handling from JsonMapParser and use the raw bytes. I do think it makes sense to have a raw source bytes/string field. But I'm not sure I see much value in an additional `original_string` field if we went that route. Obviously, if custom parsers are written or created via Stellar, you can already add whatever new fields you want, but I think there's more to it than that. Right now we have quite a bit wrapped up in `original_string`: 1. BasicParser validation 2. parser chaining 3. indexing - both Solr and ES templates 4. Alerts UI 5. Kibana dashboards 6. Zeppelin dashboards? 7. Other custom functionality users may have wrapped up in `original_string` currently? 8. How about upgrades, index migrations, and the like? What about historic data in Kafka and HDFS? Here are usages found via IDE inspection: ``` Found usages (15 usages found) Production (5 usages found) BasicAsaParser (1 usage found) parse(byte[]) (1 usage found) 162 metronJson.put(Constants.Fields.ORIGINAL.getName(), logLine); EnvelopedRawMessageStrategy (2 usages found) mergeMetadata(JSONObject, Map<String, Object>, boolean, Map<String, Object>) (2 usages found) 116 String originalStringFromMetadata = (String)metadata.get(MetadataUtil.INSTANCE.prefixKey(prefix, Constants.Fields.ORIGINAL.getName())); 124 message.put(Constants.Fields.ORIGINAL.getName(), originalStringFromMetadata); LEEFParser (1 usage found) parseOptionalResult(byte[]) (1 usage found) 198 obj.put(Fields.ORIGINAL.getName(), originalMessage); RegularExpressionsParser (1 usage found) parse(byte[]) (1 usage found) 174 parsedJson.put(Constants.Fields.ORIGINAL.getName(), originalMessage); ``` Obviously, we're a far cry from using our constants consistently since this shows only a few of the parsers. Here's the remaining results I get via a massaged text search: ``` Occurrences of 'original_string' in Project Production Files (Except Comments) Found Occurrences (55 usages found) Production (55 usages found) BaseSyslogParser.java (3 usages found) validate(JSONObject) (2 usages found) 92 if (!(message.containsKey("original_string"))) { 93 LOG.trace("[Metron] Message does not have original_string: {}", message); parseOptionalResult(byte[]) (1 usage found) 120 jsonObject.put("original_string", originalString); BasicBroParser.java (1 usage found) parse(byte[]) (1 usage found) 105 payload.put("original_string", originalString); BasicFireEyeParser.java (1 usage found) parseMessage(String) (1 usage found) 146 toReturn.put("original_string", toParse); BasicLancopeParser.java (1 usage found) parse(byte[]) (1 usage found) 83 payload.put("original_string", message); BasicLogstashParser.java (1 usage found) parse(byte[]) (1 usage found) 61 rawJson = mutate(rawJson, "message", "original_string"); BasicPaloAltoFirewallParser.java (1 usage found) parse(byte[]) (1 usage found) 185 outputMessage.put("original_string", toParse); BasicParser.java (2 usages found) validate(JSONObject) (2 usages found) 36 if (!(value.containsKey("original_string"))) { 37 LOG.trace("[Metron] Message does not have original_string: {}", message); BasicSnortParser.java (1 usage found) parse(byte[]) (1 usage found) 174 jsonMessage.put("original_string", csvMessage); BasicSourcefireParser.java (1 usage found) parse(byte[]) (1 usage found) 114 payload.put("original_string", originalString); CEFParser.java (1 usage found) parse(byte[]) (1 usage found) 178 obj.put("original_string", cefString); Constants.java (1 usage found) 58 ,ORIGINAL("original_string") Constants.java (1 usage found) 41 ,ORIGINAL("original_string") CSVParser.java (1 usage found) parse(byte[]) (1 usage found) 61 value.put("original_string", msg); GrokParser.java (2 usages found) parseMultiLine(byte[]) (1 usage found) 177 message.put("original_string", originalMessage); parseSingleLine(byte[]) (1 usage found) 226 message.put("original_string", originalMessage); JSONMapParser.java (1 usage found) parse(byte[]) (1 usage found) 192 ret.put("original_string", originalJsonObject.toJSONString()); SelectTransformation.java (1 usage found) 32 private static final List<String> systemFields = Arrays.asList("timestamp", "original_string", "source.type"); ValidationUtils.java (1 usage found) assertJsonEqual(String, String) (1 usage found) 57 if (((String) k).equals("original_string")) { adallom.schema (2 usages found) 5 "original_string": { 34 "original_string", "timestamp", adallom.schema (2 usages found) 5 "original_string": { 34 "original_string", "timestamp", cyberark.json (1 usage found) 15 original_string cyberark.json (1 usage found) 15 original_string cyberark.schema (2 usages found) 11 "original_string": { 36 "required": ["original_string", "timestamp", cyberark.schema (2 usages found) 11 "original_string": { 36 "required": ["original_string", "timestamp", dashboard-bulkload.json (4 usages found) 2 original_string 20 original_string 38 original_string 40 original_string LancopeSchema.json (2 usages found) 11 "original_string": { 27 "required": ["ip_src_addr", "ip_dst_addr", "original_string","@version", "timestamp", "type","host"] LancopeSchema.json (2 usages found) 11 "original_string": { 27 "required": ["ip_src_addr", "ip_dst_addr", "original_string","@version", "timestamp", "type","host"] palo.schema (2 usages found) 11 "original_string": { 36 "required": ["original_string", "timestamp", palo.schema (2 usages found) 11 "original_string": { 36 "required": ["original_string", "timestamp", sample.schema (2 usages found) 5 "original_string": { 25 "original_string", "timestamp", SourcefireSchema.json (3 usages found) 17 "original_string": { 20 "original_string": { 33 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", "ip_dst_port","protocol","original_string","key","timestamp"] SourcefireSchema.json (3 usages found) 17 "original_string": { 20 "original_string": { 33 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", "ip_dst_port","protocol","original_string","key","timestamp"] waf.schema (2 usages found) 17 "original_string": { 63 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", "ip_dst_port", "original_string", "timestamp", waf.schema (2 usages found) 17 "original_string": { 63 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", "ip_dst_port", "original_string", "timestamp", ``` Of course, I'm happy to talk through possible solutions, but I wanted to get as much of the possible impact on the table as possible before we decide how to move forward. **TL;DR** - I'm personally inclined towards the following actions: 1. In this PR I simply fixing the regression by using the raw source in JsonMapParser for `original_string`. For good or for worse, we currently prioritize this field as a system-level field. There will be no sub-message original_strings right now. 2. Create a Jira for assessing our original_string usage in the system and providing a path forward for managing both `raw_bytes` globally as a system field, and `original_string` per parser as a parser-owned field.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
