Matt Wise created FLUME-2768:
--------------------------------

             Summary: New ElasticSearch "structured" log behavior is wrong, and 
dangerous.
                 Key: FLUME-2768
                 URL: https://issues.apache.org/jira/browse/FLUME-2768
             Project: Flume
          Issue Type: Bug
          Components: Sinks+Sources
    Affects Versions: v1.6.0
            Reporter: Matt Wise


The new behavior introduced in Flume 1.6.0 to _automatically_ treat all JSON 
log messages as structured data 
(https://issues.apache.org/jira/browse/FLUME-2649, later fixed in 
https://issues.apache.org/jira/browse/FLUME-2126) is really dangerous, under 
documented and not controllable by a configuration switch.

*ElasticSearch Schema Change for the @message field*
The change that was made is pretty dangerous -- it assumes that if you're 
passing _any_ JSON data, you must be _only_ passing JSON data... why? Because 
as soon as you pass in {{@message}} as a {{Object}}, ElasticSearch will refuse 
any future data to the {{@message}} field that comes in {{String}} format. As 
soon as this happens, _your log events get dropped on the floor_.

*Assumes stable field-names and types*
Similar to the first issue, but more likely to bite you later on ... this 
change assumes that your field names are stable and always contain the same 
type of  data. That is, if you pass in {{"duration": "5 seconds"}} then a field 
in ElasticSearch named {{duration}} will be created with the {{"string"}} type. 
Now imagine another app writes a log message with {{"duration": 5.0"}} -- 
you're stuck, ElasticSearch cannot index that data and drops it on the floor 
because it violates the schema.

*Finally ... its an undocumented behavior change*
This is the real big one here -- this change is not documented anywhere other 
than the commit messages. Also, _you can't turn it off!_. At the very least 
this new behavior should be _optional_, controlled by a configuration switch, 
and _disabled by default_.

*Lastly ... a fix?*
I plan to release the ElasticSearchLogStashStructuredEventSerializer that we 
use here at Nextdoor that handles all of the above issues silently. It never 
touches the {{@message}} field and it automatically handles all structured log 
data by dynamically renaming fields to include {{__<field type>}} in their 
name. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to