Re: Retries in write-behind store

2018-08-29 Thread Gaurav Bajaj
Also in addition to that how about generating event when updates are failed
which can be listened to and custom logic can be added to handle the
failures?

On Wed, Aug 29, 2018 at 6:56 AM, Denis Magda  wrote:

> Val,
>
> Sounds like a handy configuration option. I would allow setting a number of
> retries. If the number is set to 0 then a failed update is discarded right
> away.
>
> --
> Denis
>
> On Tue, Aug 28, 2018 at 9:14 PM Valentin Kulichenko <
> valentin.kuliche...@gmail.com> wrote:
>
> > Folks,
> >
> > Is there a way to limit or disable retries of failed updates in the
> > write-behind store? I can't find one, it looks like if an update fails,
> it
> > is moved to the end of the queue and then eventually retried. If it fails
> > again, process is repeated.
> >
> > Such behavior *might* be OK if failures are caused by database being
> > temporarily unavailable. But what if update fails deterministically, for
> > example due to a constraint violation? There is absolutely no reason to
> > retry it, and at the same time it can cause stability and performance
> > issues when buffer is full with such "broken" updates.
> >
> > Does it makes sense to add an option that would allow to limit number of
> > retries (or even disable them)?
> >
> > -Val
> >
>


Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

2018-03-17 Thread Gaurav Bajaj
1. Data piece size (like event or entity size in bytes)

-> 1 KB

2. What is your write rate (like entities per second)
-> 8K/Sec

3. How do you evict (delete) data from the cache

-> We don't evict/delete.

4. How many caches (differ by Ignite cache name) do you have

-> 3 Caches

5. What kind of storage do you have (network, HDD, SSD, etc.)

-> SSD

6. If you can provide a solid reproducer, I'd like to investigate it.

-> We read files having data abd stream it to caches using ignite streamer.
Not sure at this time about steps to consistently reproduce this.

On 17-Mar-2018 7:36 AM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru>
wrote:

Hi Gaurav.

Could you please share your environment and some details please?
1. Data piece size (like event or entity size in bytes)
2. What is your write rate (like entities per second)
3. How do you evict (delete) data from the cache
4. How many caches (differ by Ignite cache name) do you have
5. What kind of storage do you have (network, HDD, SSD, etc.)
6. If you can provide a solid reproducer, I'd like to investigate it.

Sincerely

​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 16 March 2018 at 22:40, Gaurav Bajaj <gauravhba...@gmail.com> wrote:

> Hi,
>
> We also got exact same error. Ours is  setup without kubernetes. We are
> using ignite data streamer to put data into caches. After streaming aroung
> 500k records streamer failed with exception mentioned in original email.
>
> Thanks,
> Gaurav
>
> On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru>
> wrote:
>
>> Hi Dmitry.
>>
>> Thanks for you attention to this issue.
>>
>> I changed repository to jcenter and set Ignite version to 2.4.
>> Unfortunately the reproducer starts with the same error message in the log
>> (see attached).
>>
>> I cannot say whether behavior of the whole cluster will change on 2.4, I
>> mean if the cluster can start on corrupted data on 2.4, because we have
>> wiped the data and restarted the cluster where the problem has arrived.
>> We'll move to 2.4 next week and continue testing of our software. We are
>> moving forward to production in April/May, and it would be good if we get
>> some clue how to deal with such situation with data in the future.
>>
>>
>>
>> ​
>> Arseny Kovalchuk
>>
>> Senior Software Engineer at Synesis
>> skype: arseny.kovalchuk
>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>
>> On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov@gmail.com> wrote:
>>
>>> Hi Arseny,
>>>
>>> I've observed in reproducer
>>> ignite_version=2.3.0
>>>
>>> Could you check if it is reproducible in our freshest release 2.4.0.
>>>
>>> I'm not sure about ticket number, but it is quite possible issue is
>>> already fixed.
>>>
>>> Sincerely,
>>> Dmitriy Pavlov
>>>
>>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov@gmail.com>:
>>>
>>>> Hi Alexey,
>>>>
>>>> It may be serious issue. Could you recommend expert here who can pick
>>>> up this?
>>>>
>>>> Sincerely,
>>>> Dmitriy Pavlov
>>>>
>>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>>>> arseny.kovalc...@synesis.ru>:
>>>>
>>>>> Hi, guys.
>>>>>
>>>>> I've got a reproducer for a problem which is generally reported as
>>>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>>> instance (page content is corrupted)". Actually it reproduces the result. 
>>>>> I
>>>>> don't have an idea how the data has been corrupted, but the cluster node
>>>>> doesn't want to start with this data.
>>>>>
>>>>> We got the issue again when some of server nodes were restarted
>>>>> several times by kubernetes. I suspect that the data got corrupted during
>>>>> such restarts. But the main functionality that we really desire to have,
>>>>> that the cluster DOESN'T HANG during next restart even if the data is
>>>>> corrupted! Anyway, there is no a tool that can help to correct such data,
>>>>> and as a result we wipe all data manually to start the cluster. So, having
>>>>> warnings abo

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

2018-03-16 Thread Gaurav Bajaj
Hi,

We also got exact same error. Ours is  setup without kubernetes. We are
using ignite data streamer to put data into caches. After streaming aroung
500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" 
wrote:

> Hi Dmitry.
>
> Thanks for you attention to this issue.
>
> I changed repository to jcenter and set Ignite version to 2.4.
> Unfortunately the reproducer starts with the same error message in the log
> (see attached).
>
> I cannot say whether behavior of the whole cluster will change on 2.4, I
> mean if the cluster can start on corrupted data on 2.4, because we have
> wiped the data and restarted the cluster where the problem has arrived.
> We'll move to 2.4 next week and continue testing of our software. We are
> moving forward to production in April/May, and it would be good if we get
> some clue how to deal with such situation with data in the future.
>
>
>
> ​
> Arseny Kovalchuk
>
> Senior Software Engineer at Synesis
> skype: arseny.kovalchuk
> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
> ​LinkedIn Profile ​
>
> On 16 March 2018 at 17:03, Dmitry Pavlov  wrote:
>
>> Hi Arseny,
>>
>> I've observed in reproducer
>> ignite_version=2.3.0
>>
>> Could you check if it is reproducible in our freshest release 2.4.0.
>>
>> I'm not sure about ticket number, but it is quite possible issue is
>> already fixed.
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov :
>>
>>> Hi Alexey,
>>>
>>> It may be serious issue. Could you recommend expert here who can pick up
>>> this?
>>>
>>> Sincerely,
>>> Dmitriy Pavlov
>>>
>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>>> arseny.kovalc...@synesis.ru>:
>>>
 Hi, guys.

 I've got a reproducer for a problem which is generally reported as
 "Caused by: java.lang.IllegalStateException: Failed to get page IO
 instance (page content is corrupted)". Actually it reproduces the result. I
 don't have an idea how the data has been corrupted, but the cluster node
 doesn't want to start with this data.

 We got the issue again when some of server nodes were restarted several
 times by kubernetes. I suspect that the data got corrupted during such
 restarts. But the main functionality that we really desire to have, that
 the cluster DOESN'T HANG during next restart even if the data is corrupted!
 Anyway, there is no a tool that can help to correct such data, and as a
 result we wipe all data manually to start the cluster. So, having warnings
 about corrupted data in logs and just working cluster is the expected
 behavior.

 How to reproduce:
 1. Download the data from here https://storage.googleapi
 s.com/pub-data-0/data5.tar.gz (~200Mb)
 2. Download and import Gradle project https://storage.google
 apis.com/pub-data-0/project.tar.gz (~100Kb)
 3. Unpack the data to the home folder, say /home/user1. You should get
 the path like */home/user1/data5*. Inside data5 you should have
 binary_meta, db, marshaller.
 4. Open *src/main/resources/data-test.xml* and put the absolute path
 of unpacked data into *workDirectory* property of *igniteCfg5* bean.
 In this example it should be */home/user1/data5.* Do not
 edit consistentId! The consistentId is ignite-instance-5, so the real data
 is in the data5/db/ignite_instance_5 folder
 5. Start application from ru.synesis.kipod.DataTestBootApp
 6. Enjoy

 Hope it will help.


 ​
 Arseny Kovalchuk

 Senior Software Engineer at Synesis
 skype: arseny.kovalchuk
 mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
 ​LinkedIn Profile ​

 On 26 December 2017 at 21:15, Denis Magda  wrote:

> Cross-posting to the dev list.
>
> Ignite persistence maintainers please chime in.
>
> —
> Denis
>
 On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <
> arseny.kovalc...@synesis.ru> wrote:
>
> Hi guys.
>
> Another issue when using Ignite 2.3 with native persistence enabled.
> See details below.
>
> We deploy Ignite along with our services in Kubernetes (v 1.8) on
> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of 
> Ignite
> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>
> We put about 230 events/second into Ignite, 70% of events are ~200KB
> in size and 30% are 5000KB. Smaller events have indexed fields and we 
> query
> them via SQL.
>
> The cluster is activated from a client node which also streams events
> into Ignite from Kafka. We use custom implementation of streamer which 
> uses
> cache.putAll() API.
>
> We 

Re: Partition loss policy - how to use?

2018-03-14 Thread Gaurav Bajaj
Alexey,

Thanks. I wonder why not webconsole?

Thanks,
Gaurav

On 14-Mar-2018 1:28 AM, "Alexey Kuznetsov" <akuznet...@apache.org> wrote:

>  Gaurav,
>
> I think it make sense to add this for tools.
> Created issue: https://issues.apache.org/jira/browse/IGNITE-7940
>
> On Wed, Mar 14, 2018 at 1:44 AM, Gaurav Bajaj <gauravhba...@gmail.com>
> wrote:
>
> > Hi Denis,
> > Thanks. Document certainly looks useful. Do we have ticket for
> improvement
> > in Webconsole/Visor for marking resetLostPartitions()?
> >
> >
> > Regards,
> > Gaurav
> >
> > On 13-Mar-2018 7:42 PM, "Denis Magda" <dma...@apache.org> wrote:
> >
> > For those interested, here is a doc we put together for the partition
> > policies which considers extra improvements released in 2.4:
> > https://apacheignite.readme.io/v2.4/docs/partition-loss-policies
> >
> > --
> > Denis
> >
> > On Tue, Mar 6, 2018 at 11:19 AM, Denis Magda <dma...@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Here is documentation we prepared for 2.4 release:
> https://apacheignite.
> > > readme.io/v2.3/docs/cache-modes-24#partition-loss-policies
> > >
> > > It's hidden for now and will become visible to everyone once Ignite 2.4
> > > vote passes (in progress).
> > >
> > > --
> > > Denis
> > >
> > > On Tue, Mar 6, 2018 at 6:59 AM, gauravhb <gauravhba...@gmail.com>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> Is there any update on this topic?
> > >> Any tickets created for points mentioned by Valentin?
> > >>
> > >> Thanks.
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >>
> > >
> > >
> >
>
>
>
> --
> Alexey Kuznetsov
>


Re: Partition loss policy - how to use?

2018-03-13 Thread Gaurav Bajaj
Hi Denis,
Thanks. Document certainly looks useful. Do we have ticket for improvement
in Webconsole/Visor for marking resetLostPartitions()?


Regards,
Gaurav

On 13-Mar-2018 7:42 PM, "Denis Magda"  wrote:

For those interested, here is a doc we put together for the partition
policies which considers extra improvements released in 2.4:
https://apacheignite.readme.io/v2.4/docs/partition-loss-policies

--
Denis

On Tue, Mar 6, 2018 at 11:19 AM, Denis Magda  wrote:

> Hi,
>
> Here is documentation we prepared for 2.4 release: https://apacheignite.
> readme.io/v2.3/docs/cache-modes-24#partition-loss-policies
>
> It's hidden for now and will become visible to everyone once Ignite 2.4
> vote passes (in progress).
>
> --
> Denis
>
> On Tue, Mar 6, 2018 at 6:59 AM, gauravhb  wrote:
>
>> Hi,
>>
>> Is there any update on this topic?
>> Any tickets created for points mentioned by Valentin?
>>
>> Thanks.
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>>
>
>