[jira] [Commented] (FLUME-2649) Elasticsearch sink doesn't handle JSON fields correctly

Edward Sargisson (JIRA) Thu, 16 Apr 2015 08:54:40 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498199#comment-14498199
 ]


Edward Sargisson commented on FLUME-2649:
-----------------------------------------

I've reviewed the latest patch (#2) and my only addition is to suggest a broad 
comment which I copy here for the others on this work item to review:
"Elasticsearch will accept JSON directly but we need to validate that the 
incoming event is JSON first. Sadly, the elasticsearch JSON parser is a stream 
parser so we need to instantiate it, parse the event to validate it, then 
instantiate it again to provide the JSON to elasticsearch.
If validation fails then the incoming event is submitted to elasticsearch as 
plain text."

I reviewed FLUME-2126 and it looks like this will cover the cases in that work 
item too.

I'll give kudos to to [~faelenor] for telling us how to fix it - and continuing 
to tell us until we did. :-)

[~hshreedharan] While I'm happy with the patch I'm going to comment on 
FLUME-2126 to see if anybody wants to test their particular use case before we 
commit.

> Elasticsearch sink doesn't handle JSON fields correctly
> -------------------------------------------------------
>
>                 Key: FLUME-2649
>                 URL: https://issues.apache.org/jira/browse/FLUME-2649
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>            Reporter: Francis
>            Assignee: Benjamin Fiorini
>         Attachments: FLUME-2649-0.patch, FLUME-2649-1.patch, 
> FLUME-2649-2.patch, FLUME-2649-3.patch, FLUME-2649-4.patch, FLUME-2649-5.patch
>
>
> JSON attributes are treated like normal strings and are escaped by the sink. 
> For example, if the body or a header contains the following value:
> {code:javascript}
> {"foo":"bar"}
> {code}
> It will be added like this in Elasticsearch:
> {code:javascript}
> {"@message": "{\"foo\":\"bar\"}}"
> {code}
> We end up with a plain string instead of a valid JSON field.
> I think I found how to fix this bug. The source of the problem is caused by 
> the way a "complex field" is added. The ES XContent classes are used to parse 
> the data in the detected format, but then, instead of adding the parsed data, 
> the string() method is called and it converts it back to a string that is the 
> same as the initial data! Here is the current code with added comments:
> {code}
> XContentBuilder tmp = jsonBuilder(); // This tmp builder is completely 
> useless.
> parser = XContentFactory.xContent(contentType).createParser(data);
> parser.nextToken();
> tmp.copyCurrentStructure(parser); // This copies the whole parsed data in 
> this tmp builder.
> // Here, by calling tmp.string(), we get the parsed data converted back to a 
> string.
> // This means that tmp.string() == String(data)!
> // All this parsing for nothing...
> // And then, as the field(String, String) method is called on the builder, 
> and the builder being a jsonBuilder,
> // the string will be escaped according to the JSON specifications. 
> builder.field(fieldName, tmp.string());
> {code}
> If we really want to take advantage of the XContent classes, we have to add 
> the parsed data to the builder. To do this, it is as simply as:
> {code}
> parser = XContentFactory.xContent(contentType).createParser(data);
> parser.nextToken();
> // Add the field name, but not the value.
> builder.field(fieldName);
> // This will add the whole parsed content as the value of the field.
> builder.copyCurrentStructure(parser);
> {code}
> I tried this and it works as expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLUME-2649) Elasticsearch sink doesn't handle JSON fields correctly

Reply via email to