That's the problem -- it's a web log that contains a URL that could have literally anything in it. Anyone could put a base64 value as a random query parameter. I could have the M/R job ignore all fields that I don't explicitly expect, but that's not very flexible and prevents me from spotting possible abuse or user-error. Is there any way for me to disable ES's type-guessing or to provide a default guess? I'd rather have ES default to a string than to fail a M/R job because its type-guess was wrong.
Brian On Thu, Mar 20, 2014 at 12:26 PM, Costin Leau <[email protected]> wrote: > Then what you could do is to minimize the bulk size to say 100 documents, > turn on logging and run your data through. > This way you can catch the 'special' document in the act. > > As for expectations - Elasticsearch tries to guess the field type by > looking at its value - it seems the base64 entry looks like a date, hence > the error. You can avoid this by defining the field (either directly or > through a template) in your mapping so it always gets mapped to a string. > As a rule of thumb, whenever you want full control over the index, mapping > is the way to do it. > > > > On 3/20/14 6:10 PM, Brian Stempin wrote: > >> I have unit tests for this MR job, and they show that the JSON output is >> a string as I'd expect, so Gson is most likely >> not the cause. >> >> I'm hesitant to show more code (owned by the work-place), but I can >> describe it a little bit further: >> >> * The mapper gets a W3C log entry >> * The log entry is broken into its components and put into document X >> * The request URL is then taken and broken down into its query >> parameters and the key-value pairs are put into document Y >> * Some elements are then explicitly filtered from X and Y >> * Those two documents are placed inside of document Z, which is >> ultimately what is serialized and sent to ES >> >> >> We do have a base64 encoded param that we expect and handle >> appropriately. In this case, someone most likely sent it as >> the wrong param name, hence why its making its way into document Y >> without further processing. Since its being sent as >> a name that's not listed in the mapping, I expect it to just be treated >> as a string. >> >> The only reason that I chose to go the Gson route vs building >> MapWritables is that building MapWritables is terribly >> verbose. Also, it comes with the overhead of having to wrap each String >> with a Text type, which just seems silly. >> Using the built-in JSON serializer is just not convenient in this case. >> >> Brian >> >> >> On Thu, Mar 20, 2014 at 11:18 AM, Costin Leau <[email protected]<mailto: >> [email protected]>> wrote: >> >> My guess is that GSON adds the said field in its result. The base64 >> suggests that there's some binary data in the mix. >> >> By the way, can you show up more of your code - any reason why you >> create the JSON yourself rather than just pass >> logEntryMap to Es-Hadoop? >> It can create the json for you - which is what I recommend; unless >> you have the JSON in HDFS, it's best to rely on >> es-hadoop to do it instead of an external tool. >> >> Cheers, >> >> >> On 3/20/14 4:48 PM, Brian Stempin wrote: >> >> Hi, >> All I'm doing is building a map and passing that to Gson for >> serialization. A snippet from my map method: >> >> logEntryMap.put("cs(User-__Agent)", values[9]); >> context.write(NullWritable.__get(), new >> Text(gson.toJson(logEntryMap))__); >> >> >> values[] is a String array. Everything that goes into the map >> that gets serialized is a string. >> >> I do have es.input.json set to true. This failure doesn't occur >> until >100,000,000 records are in the index, so its >> happening late in the load process. The part that I find strange >> is that the field in question isn't in my >> mapping, and >> I've not touched the default mapping. I'm not sure why it would >> try to parse it as anything other than a string. >> >> I'll turn on TRACE logging and see what happens. >> >> Brian >> >> >> On Wed, Mar 19, 2014 at 5:35 PM, Costin Leau < >> [email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>__> >> wrote: >> >> Hi, >> >> How do you pass the json to es-hadoop? Do you have an >> example? By the way, you can enable TRACE logging on >> org.elasticsearch.hadoop and see everything that es-hadoop >> does, including the data that goes over the wire. >> My guess is that the conversion of logs to JSON creates some >> extra artifacts which are later on interpreted as >> Writable object (instead of raw JSON) by ES Hadoop. >> Make sure you tell es-hadoop that its source it's json >> (through es.input.json set to true). >> The logs will likely confirm (or not) the above :) >> >> Cheers, >> >> >> On 3/19/14 11:14 PM, Brian Stempin wrote: >> >> Hi List, >> I have an ES cluster that takes in some data from our >> logs. We use Hadoop to parse the individual log >> entries >> into JSON >> strings, which does a bulk insert using ES's output >> format. For whatever reason, ES attempts to parse >> base64 >> strings as >> a dates and fails. Here's a line from one of my Hadoop >> logs: >> >> java.lang.____IllegalStateException: Found >> unrecoverable error [Bad Request(400) - >> >> >> MapperParsingException[failed to parse [csUriParams.d]]; >> nested: MapperParsingException[failed to parse >> date >> field [REDACTED BASE64 STRING], tried both date format >> [dateOptionalTime], and timestamp number with >> locale []]; >> nested: IllegalArgumentException[____Invalid format: >> "____Y2lkPURFJml0ZW1zPWE2NTJjLXgxZT____Fj..."]; ]; >> Bailing out.. >> >> at org.elasticsearch.hadoop.rest. >> ____RestClient.__retryFailedEntries(____RestClient.java:145) >> >> at org.elasticsearch.hadoop.rest. >> ____RestClient.bulk(RestClient.____java:120) >> >> at org.elasticsearch.hadoop.rest. >> ____RestRepository.sendBatch(____RestRepository.java:147) >> >> >> >> <SNIP> >> >> >> csUriParams.d does not appear in my mapping, so I never >> explicitly asked for it to be treated as a date. >> >> Any idea why ES is trying to treat it as a date? >> >> Thanks, >> Brian >> >> -- >> You received this message because you are subscribed to >> the Google Groups "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails >> from it, send an email to >> elasticsearch+unsubscribe@__go__oglegroups.com < >> http://googlegroups.com> >> <mailto:elasticsearch%[email protected] <mailto: >> elasticsearch%[email protected]>__> >> <mailto:[email protected] >> <mailto:elasticsearch%[email protected]> <mailto: >> elasticsearch%[email protected] >> <mailto:elasticsearch%[email protected]>__>>. >> >> >> >> To view this discussion on the web visit >> https://groups.google.com/d/____msgid/elasticsearch/ >> 49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com >> <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b- >> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com> >> >> <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b- >> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b- >> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com>> >> >> <https://groups.google.com/d/____msgid/elasticsearch/ >> 49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com?utm___ >> medium=__email&utm_source=__footer >> <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b- >> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_ >> medium=__email&utm_source=footer> >> >> >> <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b- >> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_ >> medium=__email&utm_source=footer >> <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b- >> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com?utm_medium= >> email&utm_source=footer>>>. >> >> For more options, visit https://groups.google.com/d/__ >> __optout <https://groups.google.com/d/__optout> >> >> <https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>>. >> >> >> -- >> Costin >> >> >> -- >> You received this message because you are subscribed to a >> topic in the Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/____topic/elasticsearch/_____ >> iE0t92CUzA/unsubscribe >> <https://groups.google.com/d/__topic/elasticsearch/___ >> iE0t92CUzA/unsubscribe> >> >> <https://groups.google.com/d/__topic/elasticsearch/___ >> iE0t92CUzA/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/_ >> iE0t92CUzA/unsubscribe>>. >> To unsubscribe from this group and all its topics, send an >> email to >> elasticsearch+unsubscribe@__go__oglegroups.com < >> http://googlegroups.com> >> <mailto:elasticsearch%[email protected] >> <mailto:elasticsearch%[email protected]>__>. >> >> >> To view this discussion on the web visit >> https://groups.google.com/d/____msgid/elasticsearch/ >> 532A0D9C.____7010401%40gmail.com >> <https://groups.google.com/d/__msgid/elasticsearch/532A0D9C. >> __7010401%40gmail.com> >> >> <https://groups.google.com/d/__msgid/elasticsearch/532A0D9C. >> __7010401%40gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/532A0D9C. >> 7010401%40gmail.com>>. >> >> For more options, visit https://groups.google.com/d/__ >> __optout <https://groups.google.com/d/__optout> >> >> <https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>>. >> >> >> >> -- >> You received this message because you are subscribed to the >> Google Groups "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, >> send an email to >> elasticsearch+unsubscribe@__googlegroups.com <mailto: >> elasticsearch%[email protected]> >> <mailto:[email protected] <mailto: >> elasticsearch%[email protected]>>. >> To view this discussion on the web visit >> https://groups.google.com/d/__msgid/elasticsearch/__ >> CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_ >> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com> >> <https://groups.google.com/d/__msgid/elasticsearch/__ >> CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com? >> utm___medium=email&utm_source=footer >> <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_ >> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com?utm_ >> medium=email&utm_source=footer>__>. >> >> >> For more options, visit https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>. >> >> >> -- >> Costin >> >> -- >> You received this message because you are subscribed to a topic in >> the Google Groups "elasticsearch" group. >> To unsubscribe from this topic, visit https://groups.google.com/d/__ >> topic/elasticsearch/___iE0t92CUzA/unsubscribe >> <https://groups.google.com/d/topic/elasticsearch/_ >> iE0t92CUzA/unsubscribe>. >> To unsubscribe from this group and all its topics, send an email to >> elasticsearch+unsubscribe@__googlegroups.com >> <mailto:elasticsearch%[email protected]>. >> To view this discussion on the web visit >> https://groups.google.com/d/__msgid/elasticsearch/532B06B1._ >> _9010206%40gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/532B06B1. >> 9010206%40gmail.com>. >> >> >> For more options, visit https://groups.google.com/d/__optout < >> https://groups.google.com/d/optout>. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to >> [email protected] <mailto:elasticsearch+ >> [email protected]>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/ >> CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/elasticsearch/ >> CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ% >> 40mail.gmail.com?utm_medium=email&utm_source=footer>. >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > Costin > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/elasticsearch/_iE0t92CUzA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/elasticsearch/532B16AF.7030701%40gmail.com. > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANB1ciCw1u-CpdTcxvXBShuaLNDEAWgyk4Jvq4ifbxujNMiT4A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
