Re: ES Hadoop--Index only new documents without killing job from exceptions?

James Campbell Mon, 07 Jul 2014 08:27:09 -0700

Thanks, Costin.  That makes sense; I've also commented on the issue you
mentioned on github.


Having more control over the when to fail a job or choose to ignore certain
errors would definitely be a great feature from my perspective. I've
encountered a few different areas where I think extra control would be
valuable:

(1) Ability to fail on indexing failures (that persist despite the retry
policy). Currently multiple failed bulk retries are reported only via
counter. Since job control programs such as Oozie don't make it easy to
fail a workflow based on a counter, I think it makes more sense to be able
to fail a job that had batches completely fail, else the documents may
never be searchable from elastic search.

(2) DocumentAlreadyExists exceptions with the "create" write mode. Given
the batch nature of hadoop, there are cases (e.g. building autocomplete)
where it may make sense to update an index only with new data. To avoid a
reindex cost, it would be nice to be able to have a job succeed even if ES
thaws a DocumentAlreadyExists exception so we can just throw data over to
ES to check whether it exists and ignore the request if it does.

(3) Malformed/bad data. Despite (2) above, it would be ideal to still be
able to throw errors and fail a job in the case of invalid data,
particularly in case of legitimately invalid JSON (such as unescaped
special characters that may have occurred in data that is being batch
processed from a a binary container format in HDFS).


On Sun, Jul 6, 2014 at 4:38 PM, Costin Leau <[email protected]> wrote:

> I would recommend indexing the document since it's a 'cheap' operation per
> document and it covers the potential differences between the docs. Also
> from a performance POV you are not going to lose much since you are anyway
> sending the doc to ES, which does hashing and returns the error to the user.
> So the only thing that you save and _might_ potentially see is the actual
> indexing which should become a problem only when dealing with large amounts
> of docs.
>
> These being said, there's already an issue opened [1] for
> trapping/handling errors during a job (to prevent it from being cancelled)
> which potentially can be used for such a purpose as well. Free free to add
> your comments to it.
>
> [1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/160
>
>
> On Thu, Jul 3, 2014 at 8:49 PM, James Campbell <[email protected]
> > wrote:
>
>> Hi ES-Hadoop users--
>>
>> I have a large list of simple documents that I would like to index for an
>> auto complete feature. At batch processing time, I do not know which values
>> are new (never seen before) and which are not (some other part of the
>> update process changed, but the autocomplete-relevant portion of the
>> document did not).
>>
>> I believe I could simply write all of the documents to the index whenever
>> I run a new batch with the default es.write.operation=index, but that will
>> cause ES to reindex the document each time even if it wasn't updated.
>>
>> On the other hand, if I choose to use es.write.operation=create, then any
>> existing documents will cause the job to fail.
>>
>> Is there a way to combine those behaviors, so that I can allow
>> elasticsearch to simply ignore requests to reindex existing documents
>> (based on _id) but not to throw an exception that kills the entire job?
>>
>> James Campbell
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>>
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/2e5b93ef-0c42-4068-bc2c-33e4efbe429b%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/EHJQsxb-s4w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAJogdmeX_Dc-LRNcgPxY4bQ6drz43eL%3DuQnRVYYD-kjZ8%3DJebw%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAJogdmeX_Dc-LRNcgPxY4bQ6drz43eL%3DuQnRVYYD-kjZ8%3DJebw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2BAQu3xrGMFhDV%2B%2B7SGshm%2ByLHof7DV-RRy3inLOz-DVsCaHXg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES Hadoop--Index only new documents without killing job from exceptions?

Reply via email to