mmiklavc commented on issue #1409: METRON-2112 Normalize parser original_string 
handling
URL: https://github.com/apache/metron/pull/1409#issuecomment-494991158
 
 
   > I think we should simply add a new field for the source bytes or whatever, 
and never have the parsers even know about it.
   > 
   > original_string, though perhaps poorly named should remain the domain of 
the parsers.
   > 
   > the we documented it like
   > " original_string is to be representative of the string parsed to produce 
the record it resides in, such that the parse, if passed that string again 
would produce the same result, although being the domain of the parser that 
generated it, it may be some other markup or preparsed stateā€
   
   It might be worth going back to the DISCUSS thread with this convo, but here 
are my current thoughts.
   
   First, we're potentially adding the raw source bytes 2x if we do this. It's 
also not immediately obvious to me all the implications of converting the 
current `original_string` to a new purpose while also adding a new 
`original_string` without doing a fairly involved investigation. I do think 
that's worthwhile, ultimately. But getting back to the original problem at 
hand, I don't think it makes sense to make this the immediate solution to fix 
our regression. If we don't like this configuration approach, I would like to 
offer a couple much simpler solutions:
   
   1. Skip the global change proposed by @nickwallen and @simonellistonball  
for now. Offer this config option purely for the JsonMapParser, as that's the 
only place we have a must-fix regression currently
   2. Don't offer any config option, instead just remove the specialized 
original_string handling from JsonMapParser and use the raw bytes.
   
   I do think it makes sense to have a raw source bytes/string field. But I'm 
not sure I see much value in an additional `original_string` field if we went 
that route. Obviously, if custom parsers are written or created via Stellar, 
you can already add whatever new fields you want, but I think there's more to 
it than that.
   
   Right now we have quite a bit wrapped up in `original_string`:
   
   1. BasicParser validation
   2. parser chaining
   3. indexing - both Solr and ES templates
   4. Alerts UI
   5. Kibana dashboards
   6. Zeppelin dashboards?
   7. Other custom functionality users may have wrapped up in `original_string` 
currently?
   8. How about upgrades, index migrations, and the like? What about historic 
data in Kafka and HDFS?
   
   Here are usages found via IDE inspection:
   
   ```
   Found usages  (15 usages found)
       Production  (5 usages found)
           BasicAsaParser  (1 usage found)
               parse(byte[])  (1 usage found)
                   162 metronJson.put(Constants.Fields.ORIGINAL.getName(), 
logLine);
           EnvelopedRawMessageStrategy  (2 usages found)
               mergeMetadata(JSONObject, Map<String, Object>, boolean, 
Map<String, Object>)  (2 usages found)
                   116 String originalStringFromMetadata = 
(String)metadata.get(MetadataUtil.INSTANCE.prefixKey(prefix, 
Constants.Fields.ORIGINAL.getName()));
                   124 message.put(Constants.Fields.ORIGINAL.getName(), 
originalStringFromMetadata);
           LEEFParser  (1 usage found)
               parseOptionalResult(byte[])  (1 usage found)
                   198 obj.put(Fields.ORIGINAL.getName(), originalMessage);
           RegularExpressionsParser  (1 usage found)
               parse(byte[])  (1 usage found)
                   174 parsedJson.put(Constants.Fields.ORIGINAL.getName(), 
originalMessage);
   ```
   
   Obviously, we're a far cry from using our constants consistently since this 
shows only a few of the parsers. Here's the remaining results I get via a 
massaged text search:
   
   ```
       Occurrences of 'original_string' in Project Production Files (Except 
Comments)
   Found Occurrences  (55 usages found)
       Production  (55 usages found)
           BaseSyslogParser.java  (3 usages found)
               validate(JSONObject)  (2 usages found)
                   92 if (!(message.containsKey("original_string"))) {
                   93 LOG.trace("[Metron] Message does not have 
original_string: {}", message);
               parseOptionalResult(byte[])  (1 usage found)
                   120 jsonObject.put("original_string", originalString);
           BasicBroParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   105 payload.put("original_string", originalString);
           BasicFireEyeParser.java  (1 usage found)
               parseMessage(String)  (1 usage found)
                   146 toReturn.put("original_string", toParse);
           BasicLancopeParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   83 payload.put("original_string", message);
           BasicLogstashParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   61 rawJson = mutate(rawJson, "message", "original_string");
           BasicPaloAltoFirewallParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   185 outputMessage.put("original_string", toParse);
           BasicParser.java  (2 usages found)
               validate(JSONObject)  (2 usages found)
                   36 if (!(value.containsKey("original_string"))) {
                   37 LOG.trace("[Metron] Message does not have 
original_string: {}", message);
           BasicSnortParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   174 jsonMessage.put("original_string", csvMessage);
           BasicSourcefireParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   114 payload.put("original_string", originalString);
           CEFParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   178 obj.put("original_string", cefString);
           Constants.java  (1 usage found)
               58 ,ORIGINAL("original_string")
           Constants.java  (1 usage found)
               41 ,ORIGINAL("original_string")
           CSVParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   61 value.put("original_string", msg);
           GrokParser.java  (2 usages found)
               parseMultiLine(byte[])  (1 usage found)
                   177 message.put("original_string", originalMessage);
               parseSingleLine(byte[])  (1 usage found)
                   226 message.put("original_string", originalMessage);
           JSONMapParser.java  (1 usage found)
               parse(byte[])  (1 usage found)
                   192 ret.put("original_string", 
originalJsonObject.toJSONString());
           SelectTransformation.java  (1 usage found)
               32 private static final List<String> systemFields = 
Arrays.asList("timestamp", "original_string", "source.type");
           ValidationUtils.java  (1 usage found)
               assertJsonEqual(String, String)  (1 usage found)
                   57 if (((String) k).equals("original_string")) {
           adallom.schema  (2 usages found)
               5                "original_string": {
               34       "original_string", "timestamp", 
           adallom.schema  (2 usages found)
               5                "original_string": {
               34       "original_string", "timestamp", 
           cyberark.json  (1 usage found)
               15 original_string
           cyberark.json  (1 usage found)
               15 original_string
           cyberark.schema  (2 usages found)
               11               "original_string": {
               36       "required": ["original_string", "timestamp", 
           cyberark.schema  (2 usages found)
               11               "original_string": {
               36       "required": ["original_string", "timestamp", 
           dashboard-bulkload.json  (4 usages found)
               2 original_string
               20 original_string
               38 original_string
               40 original_string
           LancopeSchema.json  (2 usages found)
               11 "original_string": {
               27 "required": ["ip_src_addr", "ip_dst_addr", 
"original_string","@version", "timestamp", "type","host"]
           LancopeSchema.json  (2 usages found)
               11 "original_string": {
               27 "required": ["ip_src_addr", "ip_dst_addr", 
"original_string","@version", "timestamp", "type","host"]
           palo.schema  (2 usages found)
               11               "original_string": {
               36       "required": ["original_string", "timestamp", 
           palo.schema  (2 usages found)
               11               "original_string": {
               36       "required": ["original_string", "timestamp", 
           sample.schema  (2 usages found)
               5     "original_string": {
               25     "original_string", "timestamp",
           SourcefireSchema.json  (3 usages found)
               17 "original_string": {
               20 "original_string": {
               33 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", 
"ip_dst_port","protocol","original_string","key","timestamp"]
           SourcefireSchema.json  (3 usages found)
               17 "original_string": {
               20 "original_string": {
               33 "required": ["ip_src_addr", "ip_dst_addr", "ip_src_port", 
"ip_dst_port","protocol","original_string","key","timestamp"]
           waf.schema  (2 usages found)
               17               "original_string": {
               63       "required": ["ip_src_addr", "ip_dst_addr", 
"ip_src_port", "ip_dst_port", "original_string", "timestamp", 
           waf.schema  (2 usages found)
               17               "original_string": {
               63       "required": ["ip_src_addr", "ip_dst_addr", 
"ip_src_port", "ip_dst_port", "original_string", "timestamp", 
   ```
   
   Of course, I'm happy to talk through possible solutions, but I wanted to get 
as much of the possible impact on the table as possible before we decide how to 
move forward. 
   
   **TL;DR** - I'm personally inclined towards the following actions:
   
   1. In this PR I simply fixing the regression by using the raw source in 
JsonMapParser for `original_string`. For good or for worse, we currently 
prioritize this field as a system-level field. There will be no sub-message 
original_strings right now.
   2. Create a Jira for assessing our original_string usage in the system and 
providing a path forward for managing both `raw_bytes` globally as a system 
field, and `original_string` per parser as a parser-owned field.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to