I have unit tests for this MR job, and they show that the JSON output is a
string as I'd expect, so Gson is most likely not the cause.

I'm hesitant to show more code (owned by the work-place), but I can
describe it a little bit further:

   - The mapper gets a W3C log entry
   - The log entry is broken into its components and put into document X
   - The request URL is then taken and broken down into its query
   parameters and the key-value pairs are put into document Y
   - Some elements are then explicitly filtered from X and Y
   - Those two documents are placed inside of document Z, which is
   ultimately what is serialized and sent to ES

We do have a base64 encoded param that we expect and handle appropriately.
 In this case, someone most likely sent it as the wrong param name, hence
why its making its way into document Y without further processing.  Since
its being sent as a name that's not listed in the mapping, I expect it to
just be treated as a string.

The only reason that I chose to go the Gson route vs building MapWritables
is that building MapWritables is terribly verbose.  Also, it comes with the
overhead of having to wrap each String with a Text type, which just seems
silly.  Using the built-in JSON serializer is just not convenient in this
case.

Brian


On Thu, Mar 20, 2014 at 11:18 AM, Costin Leau <[email protected]> wrote:

> My guess is that GSON adds the said field in its result. The base64
> suggests that there's some binary data in the mix.
>
> By the way, can you show up more of your code - any reason why you create
> the JSON yourself rather than just pass logEntryMap to Es-Hadoop?
> It can create the json for you - which is what I recommend; unless you
> have the JSON in HDFS, it's best to rely on es-hadoop to do it instead of
> an external tool.
>
> Cheers,
>
>
> On 3/20/14 4:48 PM, Brian Stempin wrote:
>
>> Hi,
>> All I'm doing is building a map and passing that to Gson for
>> serialization.  A snippet from my map method:
>>
>> logEntryMap.put("cs(User-Agent)", values[9]);
>> context.write(NullWritable.get(), new Text(gson.toJson(logEntryMap)));
>>
>> values[] is a String array.  Everything that goes into the map that gets
>> serialized is a string.
>>
>> I do have es.input.json set to true.  This failure doesn't occur until
>> >100,000,000 records are in the index, so its
>> happening late in the load process.  The part that I find strange is that
>> the field in question isn't in my mapping, and
>> I've not touched the default mapping.  I'm not sure why it would try to
>> parse it as anything other than a string.
>>
>> I'll turn on TRACE logging and see what happens.
>>
>> Brian
>>
>>
>> On Wed, Mar 19, 2014 at 5:35 PM, Costin Leau <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>>     Hi,
>>
>>     How do you pass the json to es-hadoop? Do you have an example? By the
>> way, you can enable TRACE logging on
>>     org.elasticsearch.hadoop and see everything that es-hadoop does,
>> including the data that goes over the wire.
>>     My guess is that the conversion of logs to JSON creates some extra
>> artifacts which are later on interpreted as
>>     Writable object (instead of raw JSON) by ES Hadoop.
>>     Make sure you tell es-hadoop that its source it's json (through
>> es.input.json set to true).
>>     The logs will likely confirm (or not) the above :)
>>
>>     Cheers,
>>
>>
>>     On 3/19/14 11:14 PM, Brian Stempin wrote:
>>
>>         Hi List,
>>         I have an ES cluster that takes in some data from our logs.  We
>> use Hadoop to parse the individual log entries
>>         into JSON
>>         strings, which does a bulk insert using ES's output format.  For
>> whatever reason, ES attempts to parse base64
>>         strings as
>>         a dates and fails.  Here's a line from one of my Hadoop logs:
>>
>>              java.lang.__IllegalStateException: Found unrecoverable
>> error [Bad Request(400) -
>>
>>         MapperParsingException[failed to parse [csUriParams.d]]; nested:
>> MapperParsingException[failed to parse date
>>         field [REDACTED BASE64 STRING], tried both date format
>> [dateOptionalTime], and timestamp number with locale []];
>>         nested: IllegalArgumentException[__Invalid format: "__
>> Y2lkPURFJml0ZW1zPWE2NTJjLXgxZT__Fj..."]; ]; Bailing out..
>>
>>                  at org.elasticsearch.hadoop.rest.__RestClient.
>> retryFailedEntries(__RestClient.java:145)
>>
>>                  at org.elasticsearch.hadoop.rest.
>> __RestClient.bulk(RestClient.__java:120)
>>
>>                  at org.elasticsearch.hadoop.rest.
>> __RestRepository.sendBatch(__RestRepository.java:147)
>>
>>
>>                  <SNIP>
>>
>>
>>         csUriParams.d does not appear in my mapping, so I never
>> explicitly asked for it to be treated as a date.
>>
>>         Any idea why ES is trying to treat it as a date?
>>
>>         Thanks,
>>         Brian
>>
>>         --
>>         You received this message because you are subscribed to the
>> Google Groups "elasticsearch" group.
>>         To unsubscribe from this group and stop receiving emails from it,
>> send an email to
>>         elasticsearch+unsubscribe@__googlegroups.com <mailto:
>> elasticsearch%[email protected]>
>>         <mailto:[email protected] <mailto:
>> elasticsearch%[email protected]>>.
>>
>>
>>         To view this discussion on the web visit
>>         https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-_
>> _cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com
>>         <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-
>> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com>
>>         <https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-
>> __cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_
>> medium=__email&utm_source=footer
>>         <https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-
>> cec3-4914-b8d6-99440dd5fb69%40googlegroups.com?utm_medium=
>> email&utm_source=footer>>.
>>
>>         For more options, visit https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>.
>>
>>
>>     --
>>     Costin
>>
>>
>>     --
>>     You received this message because you are subscribed to a topic in
>> the Google Groups "elasticsearch" group.
>>     To unsubscribe from this topic, visit https://groups.google.com/d/__
>> topic/elasticsearch/___iE0t92CUzA/unsubscribe
>>     <https://groups.google.com/d/topic/elasticsearch/_
>> iE0t92CUzA/unsubscribe>.
>>     To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscribe@__googlegroups.com
>>     <mailto:elasticsearch%[email protected]>.
>>
>>     To view this discussion on the web visit
>>     https://groups.google.com/d/__msgid/elasticsearch/532A0D9C._
>> _7010401%40gmail.com
>>     <https://groups.google.com/d/msgid/elasticsearch/532A0D9C.
>> 7010401%40gmail.com>.
>>
>>     For more options, visit https://groups.google.com/d/__optout <
>> https://groups.google.com/d/optout>.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to
>> [email protected] <mailto:elasticsearch+
>> [email protected]>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_
>> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_
>> 68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com?utm_
>> medium=email&utm_source=footer>.
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Costin
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/elasticsearch/_iE0t92CUzA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/elasticsearch/532B06B1.9010206%40gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to