Re: Rivers are reimporting data at each ElasticSearch restart

Stéphane Seng Thu, 26 Jun 2014 01:55:34 -0700

Thanks for your quick reply,

I need some clarifications about what you meant by "delete the river", 
"delete the _river index" and by "this state is useful for flow control".


>From what I have understand from your reply and supposing that I have 
imported data into a "documents" river using the JDBC river:

   - "Delete the river" means "DELETE _river/documents" (and does not mean 
   "DELETE documents"):
      - This does not affect the already imported data.
      - The data is not reimported into ElasticSearch at restart.
      - Everything is fine for our use case.
   - "Delete the _river index" means "DELETE _river":
      - This does not affect the already imported data.
      - The data is not reimported into ElasticSearch at restart.
      - This should not be done because it affects all the rivers at the 
      same time (for the documents river, it is equivalent of doing "DELETE 
      _river/documents").
   - "This state is useful for flow control" means that:
      - The state keeps track of what data is already imported so that the 
      same raw data (left untouched in ElasticSearch) is not reimported 
multiple 
      times ?
      - OR The state keeps a trace of the SQL query so that, in case of an 
      error during a node start/stop, the river can be automatically replayed ?
      
Thanks again,
Stéphane.

On Wednesday, June 25, 2014 6:08:52 PM UTC+2, Jörg Prante wrote:
>
> Because each river can freely implement the data fetch, ES does not offer 
> river monitoring.
>
> For JDBC river, I implemented some primitive river state query commands 
> that allow polling for river state changes.
>
> Jörg
>
>
> On Wed, Jun 25, 2014 at 6:00 PM, Tanguy Bernard <[email protected] 
> <javascript:>> wrote:
>
>> Hello,
>> This post interested me.
>> Have we a way to know when  indexing is finished and thus triggered the 
>> XDELETE _river?
>>
>> Le mercredi 25 juin 2014 17:54:01 UTC+2, Jörg Prante a écrit :
>>>
>>> It is up to the river implementation how the data import is handled.
>>>
>>> The JDBC river, in the "simple" strategy, imports data when the river is 
>>> started, regardless of existing cluster or index. It is possible to 
>>> implement other strategies, for example, a strategy that performs a check 
>>> before indexing.
>>>
>>> There is no support for river implementations about node start/stop 
>>> control and how to behave. JDBC river tries to compensate this by 
>>> persisting a JDBC river specific state. This state is useful for flow 
>>> control.
>>>
>>> If you do no longer need the river, you can delete the river with curl 
>>> -XDELETE, this shuts down river instance threads gracefully and releases 
>>> resources.
>>>
>>> If you delete the _river index with curl -XDELETE, you wipe all data 
>>> that is used by rivers. Active river instances are not stopped and are not 
>>> aware of what happened, so this is an unfriendly way to terminate river 
>>> runs, all kind of river errors may occur.
>>>
>>> Jörg
>>>
>>>
>>>
>>> On Wed, Jun 25, 2014 at 5:38 PM, Stéphane Seng <[email protected]> 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a question about the fact that, when rivers are used to import 
>>>> data into ElasticSearch, rivers are also reimporting data at each 
>>>> ElasticSearch restart.
>>>>
>>>> In our project, what we are doing is as follows :
>>>>
>>>>    - Raw data is imported into ElasticSearch from a MySQL database 
>>>>    using the JDBC river (https://github.com/jprante/
>>>>    elasticsearch-river-jdbc); 
>>>>    - Some updates are executed directly on the newly imported data in 
>>>>    ElasticSearch using POST requests;
>>>>    - In the end, the final data stored in ElasticSearch is not the 
>>>>    same than the imported raw data.
>>>>    
>>>> The problem we are facing is that when ElasticSearch is restarted, the 
>>>> JDBC river is reimporting the raw data thus overriding the transformations 
>>>> made.
>>>> We suppose that this is an intentional behavior from ElasticSearch 
>>>> rivers.
>>>> One solution to avoid the reimporting of data is to delete the 
>>>> corresponding _river index, which is supposed to store the state of the 
>>>> rivers.
>>>>
>>>> Our questions are as follows :
>>>>
>>>>    - Is the reimporting of data from rivers at each restart is a 
>>>>    standard use case ? Is it useful for some applications ?
>>>>    - What is the point of the _river index state saving ? 
>>>>       - Is there a way to avoid the reimporting of data without having 
>>>>       to delete the corresponding _river index ?
>>>>       - Is there any downsides (for our use case) to delete the 
>>>>       corresponding _river index ?
>>>>       
>>>> Thanks,
>>>> Stéphane.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/a59ade79-e474-466b-bf54-1476a7c506bb%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/2b7f91f1-4fa0-4e66-8193-cd0e6fa35982%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1a91a264-f53a-49c7-91f4-1438b9de3e91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Rivers are reimporting data at each ElasticSearch restart

Reply via email to