Then what you could do is to minimize the bulk size to say 100 documents, turn 
on logging and run your data through.
This way you can catch the 'special' document in the act.

As for expectations - Elasticsearch tries to guess the field type by looking at its value - it seems the base64 entry looks like a date, hence the error. You can avoid this by defining the field (either directly or through a template) in your mapping so it always gets mapped to a string.
As a rule of thumb, whenever you want full control over the index, mapping is 
the way to do it.


On 3/20/14 6:10 PM, Brian Stempin wrote:
I have unit tests for this MR job, and they show that the JSON output is a 
string as I'd expect, so Gson is most likely
not the cause.

I'm hesitant to show more code (owned by the work-place), but I can describe it 
a little bit further:

  * The mapper gets a W3C log entry
  * The log entry is broken into its components and put into document X
  * The request URL is then taken and broken down into its query parameters and 
the key-value pairs are put into document Y
  * Some elements are then explicitly filtered from X and Y
  * Those two documents are placed inside of document Z, which is ultimately 
what is serialized and sent to ES

We do have a base64 encoded param that we expect and handle appropriately.  In 
this case, someone most likely sent it as
the wrong param name, hence why its making its way into document Y without 
further processing.  Since its being sent as
a name that's not listed in the mapping, I expect it to just be treated as a 
string.

The only reason that I chose to go the Gson route vs building MapWritables is 
that building MapWritables is terribly
verbose.  Also, it comes with the overhead of having to wrap each String with a 
Text type, which just seems silly.
  Using the built-in JSON serializer is just not convenient in this case.

Brian


On Thu, Mar 20, 2014 at 11:18 AM, Costin Leau <[email protected] 
<mailto:[email protected]>> wrote:

    My guess is that GSON adds the said field in its result. The base64 
suggests that there's some binary data in the mix.

    By the way, can you show up more of your code - any reason why you create 
the JSON yourself rather than just pass
    logEntryMap to Es-Hadoop?
    It can create the json for you - which is what I recommend; unless you have 
the JSON in HDFS, it's best to rely on
    es-hadoop to do it instead of an external tool.

    Cheers,


    On 3/20/14 4:48 PM, Brian Stempin wrote:

        Hi,
        All I'm doing is building a map and passing that to Gson for 
serialization.  A snippet from my map method:

        logEntryMap.put("cs(User-__Agent)", values[9]);
        context.write(NullWritable.__get(), new 
Text(gson.toJson(logEntryMap))__);

        values[] is a String array.  Everything that goes into the map that 
gets serialized is a string.

        I do have es.input.json set to true.  This failure doesn't occur until 
>100,000,000 records are in the index, so its
        happening late in the load process.  The part that I find strange is 
that the field in question isn't in my
        mapping, and
        I've not touched the default mapping.  I'm not sure why it would try to 
parse it as anything other than a string.

        I'll turn on TRACE logging and see what happens.

        Brian


        On Wed, Mar 19, 2014 at 5:35 PM, Costin Leau <[email protected] 
<mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>__> wrote:

             Hi,

             How do you pass the json to es-hadoop? Do you have an example? By 
the way, you can enable TRACE logging on
             org.elasticsearch.hadoop and see everything that es-hadoop does, 
including the data that goes over the wire.
             My guess is that the conversion of logs to JSON creates some extra 
artifacts which are later on interpreted as
             Writable object (instead of raw JSON) by ES Hadoop.
             Make sure you tell es-hadoop that its source it's json (through 
es.input.json set to true).
             The logs will likely confirm (or not) the above :)

             Cheers,


             On 3/19/14 11:14 PM, Brian Stempin wrote:

                 Hi List,
                 I have an ES cluster that takes in some data from our logs.  
We use Hadoop to parse the individual log
        entries
                 into JSON
                 strings, which does a bulk insert using ES's output format.  
For whatever reason, ES attempts to parse
        base64
                 strings as
                 a dates and fails.  Here's a line from one of my Hadoop logs:

                      java.lang.____IllegalStateException: Found unrecoverable 
error [Bad Request(400) -

                 MapperParsingException[failed to parse [csUriParams.d]]; 
nested: MapperParsingException[failed to parse
        date
                 field [REDACTED BASE64 STRING], tried both date format 
[dateOptionalTime], and timestamp number with
        locale []];
                 nested: IllegalArgumentException[____Invalid format: 
"____Y2lkPURFJml0ZW1zPWE2NTJjLXgxZT____Fj..."]; ];
        Bailing out..

                          at 
org.elasticsearch.hadoop.rest.____RestClient.__retryFailedEntries(____RestClient.java:145)

                          at 
org.elasticsearch.hadoop.rest.____RestClient.bulk(RestClient.____java:120)

                          at 
org.elasticsearch.hadoop.rest.____RestRepository.sendBatch(____RestRepository.java:147)


                          <SNIP>


                 csUriParams.d does not appear in my mapping, so I never 
explicitly asked for it to be treated as a date.

                 Any idea why ES is trying to treat it as a date?

                 Thanks,
                 Brian

                 --
                 You received this message because you are subscribed to the Google 
Groups "elasticsearch" group.
                 To unsubscribe from this group and stop receiving emails from 
it, send an email to
                 elasticsearch+unsubscribe@__go__oglegroups.com 
<http://googlegroups.com>
        <mailto:elasticsearch%[email protected] 
<mailto:elasticsearch%[email protected]>__>
                 <mailto:[email protected]
        <mailto:elasticsearch%[email protected]> 
<mailto:elasticsearch%[email protected]
        <mailto:elasticsearch%[email protected]>__>>.


                 To view this discussion on the web visit
        
https://groups.google.com/d/____msgid/elasticsearch/49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com
        
<https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-__cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com>

        
<https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-__cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com
        
<https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-cec3-4914-b8d6-99440dd5fb69%40googlegroups.com>>

        
<https://groups.google.com/d/____msgid/elasticsearch/49e5fe0b-____cec3-4914-b8d6-99440dd5fb69%____40googlegroups.com?utm___medium=__email&utm_source=__footer
        
<https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-__cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_medium=__email&utm_source=footer>

        
<https://groups.google.com/d/__msgid/elasticsearch/49e5fe0b-__cec3-4914-b8d6-99440dd5fb69%__40googlegroups.com?utm_medium=__email&utm_source=footer
        
<https://groups.google.com/d/msgid/elasticsearch/49e5fe0b-cec3-4914-b8d6-99440dd5fb69%40googlegroups.com?utm_medium=email&utm_source=footer>>>.

                 For more options, visit https://groups.google.com/d/____optout 
<https://groups.google.com/d/__optout>
        <https://groups.google.com/d/__optout 
<https://groups.google.com/d/optout>>.


             --
             Costin


             --
             You received this message because you are subscribed to a topic in the 
Google Groups "elasticsearch" group.
             To unsubscribe from this topic, visit
        
https://groups.google.com/d/____topic/elasticsearch/_____iE0t92CUzA/unsubscribe
        
<https://groups.google.com/d/__topic/elasticsearch/___iE0t92CUzA/unsubscribe>
             
<https://groups.google.com/d/__topic/elasticsearch/___iE0t92CUzA/unsubscribe
        
<https://groups.google.com/d/topic/elasticsearch/_iE0t92CUzA/unsubscribe>>.
             To unsubscribe from this group and all its topics, send an email to
        elasticsearch+unsubscribe@__go__oglegroups.com <http://googlegroups.com>
             <mailto:elasticsearch%[email protected]
        <mailto:elasticsearch%[email protected]>__>.

             To view this discussion on the web visit
        
https://groups.google.com/d/____msgid/elasticsearch/532A0D9C.____7010401%40gmail.com
        
<https://groups.google.com/d/__msgid/elasticsearch/532A0D9C.__7010401%40gmail.com>
             
<https://groups.google.com/d/__msgid/elasticsearch/532A0D9C.__7010401%40gmail.com
        
<https://groups.google.com/d/msgid/elasticsearch/532A0D9C.7010401%40gmail.com>>.

             For more options, visit https://groups.google.com/d/____optout 
<https://groups.google.com/d/__optout>
        <https://groups.google.com/d/__optout 
<https://groups.google.com/d/optout>>.



        --
        You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
        To unsubscribe from this group and stop receiving emails from it, send 
an email to
        elasticsearch+unsubscribe@__googlegroups.com 
<mailto:elasticsearch%[email protected]>
        <mailto:[email protected] 
<mailto:elasticsearch%[email protected]>>.
        To view this discussion on the web visit
        
https://groups.google.com/d/__msgid/elasticsearch/__CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com
        
<https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com>
        
<https://groups.google.com/d/__msgid/elasticsearch/__CANB1ciCdBYj___68DCxEcDxfYucuyhJ7NPWrmEWtV2Cy__pqGp0dSA%40mail.gmail.com?utm___medium=email&utm_source=footer
        
<https://groups.google.com/d/msgid/elasticsearch/CANB1ciCdBYj_68DCxEcDxfYucuyhJ7NPWrmEWtV2CypqGp0dSA%40mail.gmail.com?utm_medium=email&utm_source=footer>__>.

        For more options, visit https://groups.google.com/d/__optout 
<https://groups.google.com/d/optout>.


    --
    Costin

    --
    You received this message because you are subscribed to a topic in the Google Groups 
"elasticsearch" group.
    To unsubscribe from this topic, visit 
https://groups.google.com/d/__topic/elasticsearch/___iE0t92CUzA/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/_iE0t92CUzA/unsubscribe>.
    To unsubscribe from this group and all its topics, send an email to 
elasticsearch+unsubscribe@__googlegroups.com
    <mailto:elasticsearch%[email protected]>.
    To view this discussion on the web visit
    
https://groups.google.com/d/__msgid/elasticsearch/532B06B1.__9010206%40gmail.com
    
<https://groups.google.com/d/msgid/elasticsearch/532B06B1.9010206%40gmail.com>.

    For more options, visit https://groups.google.com/d/__optout 
<https://groups.google.com/d/optout>.


--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
[email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com
<https://groups.google.com/d/msgid/elasticsearch/CANB1ciBy6jCC8YVT4FPi03g9TgGkt-QhB%2BUQKfWvDioYBnRopQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/532B16AF.7030701%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to