Re: Rivers are reimporting data at each ElasticSearch restart

Tanguy Bernard Wed, 25 Jun 2014 09:00:22 -0700

Hello,
This post interested me.
Have we a way to know when  indexing is finished and thus triggered the 
XDELETE _river?


Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :
>
> It is up to the river implementation how the data import is handled.
>
> The JDBC river, in the "simple" strategy, imports data when the river is 
> started, regardless of existing cluster or index. It is possible to 
> implement other strategies, for example, a strategy that performs a check 
> before indexing.
>
> There is no support for river implementations about node start/stop 
> control and how to behave. JDBC river tries to compensate this by 
> persisting a JDBC river specific state. This state is useful for flow 
> control.
>
> If you do no longer need the river, you can delete the river with curl 
> -XDELETE, this shuts down river instance threads gracefully and releases 
> resources.
>
> If you delete the _river index with curl -XDELETE, you wipe all data that 
> is used by rivers. Active river instances are not stopped and are not aware 
> of what happened, so this is an unfriendly way to terminate river runs, all 
> kind of river errors may occur.
>
> Jörg
>
>
>
> On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng <[email protected] 
> <javascript:>> wrote:
>
>> Hello,
>>
>> I have a question about the fact that, when rivers are used to import 
>> data into ElasticSearch, rivers are also reimporting data at each 
>> ElasticSearch restart.
>>
>> In our project, what we are doing is as follows :
>>
>>    - Raw data is imported into ElasticSearch from a MySQL database using 
>>    the JDBC river (https://github.com/jprante/elasticsearch-river-jdbc); 
>>    - Some updates are executed directly on the newly imported data in 
>>    ElasticSearch using POST requests;
>>    - In the end, the final data stored in ElasticSearch is not the same 
>>    than the imported raw data.
>>    
>> The problem we are facing is that when ElasticSearch is restarted, the 
>> JDBC river is reimporting the raw data thus overriding the transformations 
>> made.
>> We suppose that this is an intentional behavior from ElasticSearch rivers.
>> One solution to avoid the reimporting of data is to delete the 
>> corresponding _river index, which is supposed to store the state of the 
>> rivers.
>>
>> Our questions are as follows :
>>
>>    - Is the reimporting of data from rivers at each restart is a 
>>    standard use case ? Is it useful for some applications ?
>>    - What is the point of the _river index state saving ? 
>>       - Is there a way to avoid the reimporting of data without having 
>>       to delete the corresponding _river index ?
>>       - Is there any downsides (for our use case) to delete the 
>>       corresponding _river index ?
>>       
>> Thanks,
>> Stéphane.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Rivers are reimporting data at each ElasticSearch restart

Reply via email to