I have unit tests for this MR job, and they show that the JSON output is a string as I'd expect, so Gson is most likely not the cause.
I'm hesitant to show more code (owned by the work-place), but I can describe it a little bit further: - The mapper gets a W3C log entry - The log entry is broken into its components and put into document X - The request URL is then taken and broken down into its query parameters and the key-value pairs are put into document Y - Some elements are then explicitly filtered from X and Y - Those two documents are placed inside of document Z, which is ultimately what is serialized and sent to ES We do have a base64 encoded param that we expect and handle appropriately. In this case, someone most likely sent it as the wrong param name, hence why its making its way into document Y without further processing. Since its being sent as a name that's not listed in the mapping, I expect it to just be treated as a string. The only reason that I chose to go the Gson route vs building MapWritables is that building MapWritables is terribly verbose. Also, it comes with the overhead of having to wrap each String with a Text type, which just seems silly. Using the built-in JSON serializer is just not convenient in this case. Brian On Thu, Mar 20, 2014 at 11:18 AM, Costin Leau <[email protected]> wrote: > My guess is that GSON adds the said field in its result. The base64 > suggests that there's some binary data in the mix. > > By the way, can you show up more of your code - any reason why you create > the JSON yourself rather than just pass logEntryMap to Es-Hadoop? > It can create the json for you - which is what I recommend; unless you > have the JSON in HDFS, it's best to rely on es-hadoop to do it instead of > an external tool. > > Cheers, > > > On 3/20/14 4:48 PM, Brian Stempin wrote: > >> Hi, >> All I'm doing is building a map and passing that to Gson for >> serialization. A snippet from my map method: >> >> logEntryMap.put("cs(User-Agent)", values[9]); >> context.write(NullWritable.get(), new Text(gson.toJson(logEntryMap))); >> >> values[] is a String array. Everything that goes into the map that gets >> serialized is a string. >> >> I do have es.input.json set to true. This failure doesn't occur until >> >100,000,000 records are in the index, so its >> happening late in the load process. The part that I find strange is that >> the field in question isn't in my mapping, and >> I've not touched the default mapping. I'm not sure why it would try to >> parse it as anything other than a string. >> >> I'll turn on TRACE logging and see what happens. >> >> Brian >> >> >> On Wed, Mar 19, 2014 at 5:35 PM, Costin Leau <[email protected]<mailto: >> [email protected]>> wrote: >> >> Hi, >> >> How do you pass the json to es-hadoop? Do you have an example? By the >> way, you can enable TRACE logging on >> org.elasticsearch.hadoop and see everything that es-hadoop does, >> including the data that goes over the wire. >> My guess is that the conversion of logs to JSON creates some extra >> artifacts which are later on interpreted as >> Writable object (instead of raw JSON) by ES Hadoop. >> Make sure you tell es-hadoop that its source it's json (through >> es.input.json set to true). >> The logs will likely confirm (or not) the above :) >> >> Cheers, >> >> >> On 3/19/14 11:14 PM, Brian Stempin wrote: >> >> Hi List, >> I have an ES cluster that takes in some data from our logs. We >> use Hadoop to parse the individual log entries >> into JSON >> strings, which does a bulk insert using ES's output format. For >> whatever reason, ES attempts to parse base64 >> strings as >> a dates and fails. Here's a line from one of my Hadoop logs: >> >> java.lang.__IllegalStateException: Found unrecoverable >> error [Bad Request(400) - >> >> MapperParsingException[failed to parse [csUriParams.d]]; nested: >> MapperParsingException[failed to parse date >> field [REDACTED BASE64 STRING], tried both date format >> [dateOptionalTime], and timestamp number with locale []]; >> nested: IllegalArgumentException[__Invalid format: "__ >> Y2lkPURFJml0ZW1zPWE2NTJjLXgxZT__Fj..."]; ]; Bailing out.. >> >> at org.elasticsearch.hadoop.rest.__RestClient. >> retryFailedEntries(__RestClient.java:145) >> >> at org.elasticsearch.hadoop.rest. >> __RestClient.bulk(RestClient.__java:120) >> >> at org.elasticsearch.hadoop.rest. >> __RestRepository.sendBatch(__RestRepository.java:147) >> >> >> <SNIP> >> >> >> csUriParams.d does not appear in my mapping, so I never >> explicitly asked for it to be treated as a date. >> >> Any idea why ES is trying to treat it as a date? >> >> Thanks, >> Brian >> >> -- >> You received this message because you are subscribed to the >> Google Groups "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, >> send an email to >> elasticsearch+unsubscribe@__googlegroups.com <mailto: >> elasticsearch%[email protected]> >> <mailto:[email protected] <mailto: >> elasticsearch%[email protected]>>. >> >> >> To view this discussion on the web visit >> https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-_ >> _cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b- >> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com> >> <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b- >> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_ >> medium=__email&utm_source=footer >> <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b- >> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com?utm_medium= >> email&utm_source=footer>>. >> >> For more options, visit https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>. >> >> >> -- >> Costin >> >> >> -- >> You received this message because you are subscribed to a topic in >> the Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit https://groups.google.com/d/__ >> topic/elasticsearch/___iE0t92CUzA/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/_ >> iE0t92CUzA/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> elasticsearch+unsubscribe@__googlegroups.com >> <mailto:elasticsearch%[email protected]>. >> >> To view this discussion on the web visit >> https://groups.google.com/d/__msgid/elasticsearch/532A0D9C._ >> _7010401%40gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/532A0D9C. >> 7010401%40gmail.com>. >> >> For more options, visit https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>. >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to >> [email protected] <mailto:elasticsearch+ >> [email protected]>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_ >> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_ >> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com?utm_ >> medium=email&utm_source=footer>. >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > Costin > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/elasticsearch/_iE0t92CUzA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/elasticsearch/532B06B1.9010206%40gmail.com. > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
