[
https://issues.apache.org/jira/browse/FLUME-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804456#comment-13804456
]
Rotem Hermon commented on FLUME-2220:
-------------------------------------
Hi Dib
v1 will certainly makes the schema less noisy, but this issue is not due to the
v0 schema, it's just seems like a bug in the sink. The serializer creates a map
of headers, extracts some fields from this map and sets them as top fields, and
then goes over all the items in the map and adds them under the "@fields"
field. So the items that where extracted before and were already added as
logstash fields are added again also under "@fields". This is redundant. Items
from the map that where added should be removed from the map before doing the
generic adding so they won't appear twice.
Hope I managed to be clear. If I'll get to it I'll try to attach a code fix
(still trying to understand the procedure of submitting code to an Apache
project...).
> ElasticSearch sink - duplicate fields in indexed document
> ---------------------------------------------------------
>
> Key: FLUME-2220
> URL: https://issues.apache.org/jira/browse/FLUME-2220
> Project: Flume
> Issue Type: Bug
> Affects Versions: v1.4.0
> Reporter: Rotem Hermon
> Priority: Minor
> Labels: ElasticSearch, sink
>
> The default serializer for the ElasticSearch sink
> (ElasticSearchLogStashEventSerializer) duplicates fields that are mapped to
> default logstash fields.
> For instance timestamp, source, host. Those appear both as logstash fields
> ("@timestamp", "@source_host" etc.), and both as fields under the @fields
> ("@fields.timestamp", "@fields.host").
> When inserting a field from the headers as a logstash system field it should
> be removed from the dictionary so it wouldn't get written again under the
> "@fields" field.
--
This message was sent by Atlassian JIRA
(v6.1#6144)