That's the problem -- it's a web log that contains a URL that could have
literally anything in it.  Anyone could put a base64 value as a random
query parameter.  I could have the M/R job ignore all fields that I don't
explicitly expect, but that's not very flexible and prevents me from
spotting possible abuse or user-error.  Is there any way for me to disable
ES's type-guessing or to provide a default guess?  I'd rather have ES
default to a string than to fail a M/R job because its type-guess was wrong.

Brian


On Thu, Mar 20, 2014 at 12:26 PM, Costin Leau <[email protected]> wrote:

> Then what you could do is to minimize the bulk size to say 100 documents,
> turn on logging and run your data through.
> This way you can catch the 'special' document in the act.
>
> As for expectations - Elasticsearch tries to guess the field type by
> looking at its value - it seems the base64 entry looks like a date, hence
> the error. You can avoid this by defining the field (either directly or
> through a template) in your mapping so it always gets mapped to a string.
> As a rule of thumb, whenever you want full control over the index, mapping
> is the way to do it.
>
>
>
> On 3/20/14 6:10 PM, Brian Stempin wrote:
>
>> I have unit tests for this MR job, and they show that the JSON output is
>> a string as I'd expect, so Gson is most likely
>> not the cause.
>>
>> I'm hesitant to show more code (owned by the work-place), but I can
>> describe it a little bit further:
>>
>>   * The mapper gets a W3C log entry
>>   * The log entry is broken into its components and put into document X
>>   * The request URL is then taken and broken down into its query
>> parameters and the key-value pairs are put into document Y
>>   * Some elements are then explicitly filtered from X and Y
>>   * Those two documents are placed inside of document Z, which is
>> ultimately what is serialized and sent to ES
>>
>>
>> We do have a base64 encoded param that we expect and handle
>> appropriately.  In this case, someone most likely sent it as
>> the wrong param name, hence why its making its way into document Y
>> without further processing.  Since its being sent as
>> a name that's not listed in the mapping, I expect it to just be treated
>> as a string.
>>
>> The only reason that I chose to go the Gson route vs building
>> MapWritables is that building MapWritables is terribly
>> verbose.  Also, it comes with the overhead of having to wrap each String
>> with a Text type, which just seems silly.
>>   Using the built-in JSON serializer is just not convenient in this case.
>>
>> Brian
>>
>>
>> On Thu, Mar 20, 2014 at 11:18 AM, Costin Leau <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>>     My guess is that GSON adds the said field in its result. The base64
>> suggests that there's some binary data in the mix.
>>
>>     By the way, can you show up more of your code - any reason why you
>> create the JSON yourself rather than just pass
>>     logEntryMap to Es-Hadoop?
>>     It can create the json for you - which is what I recommend; unless
>> you have the JSON in HDFS, it's best to rely on
>>     es-hadoop to do it instead of an external tool.
>>
>>     Cheers,
>>
>>
>>     On 3/20/14 4:48 PM, Brian Stempin wrote:
>>
>>         Hi,
>>         All I'm doing is building a map and passing that to Gson for
>> serialization.  A snippet from my map method:
>>
>>         logEntryMap.put("cs(User-__Agent)", values[9]);
>>         context.write(NullWritable.__get(), new
>> Text(gson.toJson(logEntryMap))__);
>>
>>
>>         values[] is a String array.  Everything that goes into the map
>> that gets serialized is a string.
>>
>>         I do have es.input.json set to true.  This failure doesn't occur
>> until >100,000,000 records are in the index, so its
>>         happening late in the load process.  The part that I find strange
>> is that the field in question isn't in my
>>         mapping, and
>>         I've not touched the default mapping.  I'm not sure why it would
>> try to parse it as anything other than a string.
>>
>>         I'll turn on TRACE logging and see what happens.
>>
>>         Brian
>>
>>
>>         On Wed, Mar 19, 2014 at 5:35 PM, Costin Leau <
>> [email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>__>
>> wrote:
>>
>>              Hi,
>>
>>              How do you pass the json to es-hadoop? Do you have an
>> example? By the way, you can enable TRACE logging on
>>              org.elasticsearch.hadoop and see everything that es-hadoop
>> does, including the data that goes over the wire.
>>              My guess is that the conversion of logs to JSON creates some
>> extra artifacts which are later on interpreted as
>>              Writable object (instead of raw JSON) by ES Hadoop.
>>              Make sure you tell es-hadoop that its source it's json
>> (through es.input.json set to true).
>>              The logs will likely confirm (or not) the above :)
>>
>>              Cheers,
>>
>>
>>              On 3/19/14 11:14 PM, Brian Stempin wrote:
>>
>>                  Hi List,
>>                  I have an ES cluster that takes in some data from our
>> logs.  We use Hadoop to parse the individual log
>>         entries
>>                  into JSON
>>                  strings, which does a bulk insert using ES's output
>> format.  For whatever reason, ES attempts to parse
>>         base64
>>                  strings as
>>                  a dates and fails.  Here's a line from one of my Hadoop
>> logs:
>>
>>                       java.lang.____IllegalStateException: Found
>> unrecoverable error [Bad Request(400) -
>>
>>
>>                  MapperParsingException[failed to parse [csUriParams.d]];
>> nested: MapperParsingException[failed to parse
>>         date
>>                  field [REDACTED BASE64 STRING], tried both date format
>> [dateOptionalTime], and timestamp number with
>>         locale []];
>>                  nested: IllegalArgumentException[____Invalid format:
>> "____Y2lkPURFJml0ZW1zPWE2NTJjLXgxZT____Fj..."]; ];
>>         Bailing out..
>>
>>                           at org.elasticsearch.hadoop.rest.
>> ____RestClient.__retryFailedEntries(____RestClient.java:145)
>>
>>                           at org.elasticsearch.hadoop.rest.
>> ____RestClient.bulk(RestClient.____java:120)
>>
>>                           at org.elasticsearch.hadoop.rest.
>> ____RestRepository.sendBatch(____RestRepository.java:147)
>>
>>
>>
>>                           <SNIP>
>>
>>
>>                  csUriParams.d does not appear in my mapping, so I never
>> explicitly asked for it to be treated as a date.
>>
>>                  Any idea why ES is trying to treat it as a date?
>>
>>                  Thanks,
>>                  Brian
>>
>>                  --
>>                  You received this message because you are subscribed to
>> the Google Groups "elasticsearch" group.
>>                  To unsubscribe from this group and stop receiving emails
>> from it, send an email to
>>                  elasticsearch+unsubscribe@__go__oglegroups.com <
>> http://googlegroups.com>
>>         <mailto:elasticsearch%[email protected] <mailto:
>> elasticsearch%[email protected]>__>
>>                  <mailto:[email protected]
>>         <mailto:elasticsearch%[email protected]> <mailto:
>> elasticsearch%[email protected]
>>         <mailto:elasticsearch%[email protected]>__>>.
>>
>>
>>
>>                  To view this discussion on the web visit
>>         https://groups.google.com/d/____msgid/elasticsearch/
>> 49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com
>>         <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-
>> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com>
>>
>>         <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-
>> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com
>>         <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-
>> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com>>
>>
>>         <https://groups.google.com/d/____msgid/elasticsearch/
>> 49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com?utm___
>> medium=__email&utm_source=__footer
>>         <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-
>> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_
>> medium=__email&utm_source=footer>
>>
>>
>>         <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-
>> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_
>> medium=__email&utm_source=footer
>>         <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-
>> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com?utm_medium=
>> email&utm_source=footer>>>.
>>
>>                  For more options, visit https://groups.google.com/d/__
>> __optout <https://groups.google.com/d/__optout>
>>
>>         <https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>>.
>>
>>
>>              --
>>              Costin
>>
>>
>>              --
>>              You received this message because you are subscribed to a
>> topic in the Google Groups "elasticsearch" group.
>>              To unsubscribe from this topic, visit
>>         https://groups.google.com/d/____topic/elasticsearch/_____
>> iE0t92CUzA/unsubscribe
>>         <https://groups.google.com/d/__topic/elasticsearch/___
>> iE0t92CUzA/unsubscribe>
>>
>>              <https://groups.google.com/d/__topic/elasticsearch/___
>> iE0t92CUzA/unsubscribe
>>         <https://groups.google.com/d/topic/elasticsearch/_
>> iE0t92CUzA/unsubscribe>>.
>>              To unsubscribe from this group and all its topics, send an
>> email to
>>         elasticsearch+unsubscribe@__go__oglegroups.com <
>> http://googlegroups.com>
>>              <mailto:elasticsearch%[email protected]
>>         <mailto:elasticsearch%[email protected]>__>.
>>
>>
>>              To view this discussion on the web visit
>>         https://groups.google.com/d/____msgid/elasticsearch/
>> 532A0D9C.____7010401%40gmail.com
>>         <https://groups.google.com/d/__msgid/elasticsearch/532A0D9C.
>> __7010401%40gmail.com>
>>
>>              <https://groups.google.com/d/__msgid/elasticsearch/532A0D9C.
>> __7010401%40gmail.com
>>         <https://groups.google.com/d/msgid/elasticsearch/532A0D9C.
>> 7010401%40gmail.com>>.
>>
>>              For more options, visit https://groups.google.com/d/__
>> __optout <https://groups.google.com/d/__optout>
>>
>>         <https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>>.
>>
>>
>>
>>         --
>>         You received this message because you are subscribed to the
>> Google Groups "elasticsearch" group.
>>         To unsubscribe from this group and stop receiving emails from it,
>> send an email to
>>         elasticsearch+unsubscribe@__googlegroups.com <mailto:
>> elasticsearch%[email protected]>
>>         <mailto:[email protected] <mailto:
>> elasticsearch%[email protected]>>.
>>         To view this discussion on the web visit
>>         https://groups.google.com/d/__msgid/elasticsearch/__
>> CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com
>>         <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_
>> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com>
>>         <https://groups.google.com/d/__msgid/elasticsearch/__
>> CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com?
>> utm___medium=email&utm_source=footer
>>         <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_
>> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com?utm_
>> medium=email&utm_source=footer>__>.
>>
>>
>>         For more options, visit https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>.
>>
>>
>>     --
>>     Costin
>>
>>     --
>>     You received this message because you are subscribed to a topic in
>> the Google Groups "elasticsearch" group.
>>     To unsubscribe from this topic, visit https://groups.google.com/d/__
>> topic/elasticsearch/___iE0t92CUzA/unsubscribe
>>     <https://groups.google.com/d/topic/elasticsearch/_
>> iE0t92CUzA/unsubscribe>.
>>     To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscribe@__googlegroups.com
>>     <mailto:elasticsearch%[email protected]>.
>>     To view this discussion on the web visit
>>     https://groups.google.com/d/__msgid/elasticsearch/532B06B1._
>> _9010206%40gmail.com
>>     <https://groups.google.com/d/msgid/elasticsearch/532B06B1.
>> 9010206%40gmail.com>.
>>
>>
>>     For more options, visit https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to
>> [email protected] <mailto:elasticsearch+
>> [email protected]>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/
>> CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/
>> CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%
>> 40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Costin
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/elasticsearch/_iE0t92CUzA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/elasticsearch/532B16AF.7030701%40gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANB1ciCw1u-CpdTcxvXBShuaLNDEAWgyk4Jvq4ifbxujNMiT4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to