Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

2018-03-17 Thread Arseny Kovalchuk
Hi Gaurav.

Could you please share your environment and some details please?
1. Data piece size (like event or entity size in bytes)
2. What is your write rate (like entities per second)
3. How do you evict (delete) data from the cache
4. How many caches (differ by Ignite cache name) do you have
5. What kind of storage do you have (network, HDD, SSD, etc.)
6. If you can provide a solid reproducer, I'd like to investigate it.

Sincerely

​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 16 March 2018 at 22:40, Gaurav Bajaj <gauravhba...@gmail.com> wrote:

> Hi,
>
> We also got exact same error. Ours is  setup without kubernetes. We are
> using ignite data streamer to put data into caches. After streaming aroung
> 500k records streamer failed with exception mentioned in original email.
>
> Thanks,
> Gaurav
>
> On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <arseny.kovalc...@synesis.ru>
> wrote:
>
>> Hi Dmitry.
>>
>> Thanks for you attention to this issue.
>>
>> I changed repository to jcenter and set Ignite version to 2.4.
>> Unfortunately the reproducer starts with the same error message in the log
>> (see attached).
>>
>> I cannot say whether behavior of the whole cluster will change on 2.4, I
>> mean if the cluster can start on corrupted data on 2.4, because we have
>> wiped the data and restarted the cluster where the problem has arrived.
>> We'll move to 2.4 next week and continue testing of our software. We are
>> moving forward to production in April/May, and it would be good if we get
>> some clue how to deal with such situation with data in the future.
>>
>>
>>
>> ​
>> Arseny Kovalchuk
>>
>> Senior Software Engineer at Synesis
>> skype: arseny.kovalchuk
>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>
>> On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov@gmail.com> wrote:
>>
>>> Hi Arseny,
>>>
>>> I've observed in reproducer
>>> ignite_version=2.3.0
>>>
>>> Could you check if it is reproducible in our freshest release 2.4.0.
>>>
>>> I'm not sure about ticket number, but it is quite possible issue is
>>> already fixed.
>>>
>>> Sincerely,
>>> Dmitriy Pavlov
>>>
>>> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov@gmail.com>:
>>>
>>>> Hi Alexey,
>>>>
>>>> It may be serious issue. Could you recommend expert here who can pick
>>>> up this?
>>>>
>>>> Sincerely,
>>>> Dmitriy Pavlov
>>>>
>>>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>>>> arseny.kovalc...@synesis.ru>:
>>>>
>>>>> Hi, guys.
>>>>>
>>>>> I've got a reproducer for a problem which is generally reported as
>>>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO
>>>>> instance (page content is corrupted)". Actually it reproduces the result. 
>>>>> I
>>>>> don't have an idea how the data has been corrupted, but the cluster node
>>>>> doesn't want to start with this data.
>>>>>
>>>>> We got the issue again when some of server nodes were restarted
>>>>> several times by kubernetes. I suspect that the data got corrupted during
>>>>> such restarts. But the main functionality that we really desire to have,
>>>>> that the cluster DOESN'T HANG during next restart even if the data is
>>>>> corrupted! Anyway, there is no a tool that can help to correct such data,
>>>>> and as a result we wipe all data manually to start the cluster. So, having
>>>>> warnings about corrupted data in logs and just working cluster is the
>>>>> expected behavior.
>>>>>
>>>>> How to reproduce:
>>>>> 1. Download the data from here https://storage.googleapi
>>>>> s.com/pub-data-0/data5.tar.gz (~200Mb)
>>>>> 2. Download and import Gradle project https://storage.google
>>>>> apis.com/pub-data-0/project.tar.gz (~100Kb)
>>>>> 3. Unpack the data to the home folder, say /home/user1. You should get
>>>>> the path like */home/user1/data5*. Inside data5 you should have
>>>>> binary_meta, db, marshaller.
>>>>> 4. Open *src/main/resources

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

2018-03-16 Thread Arseny Kovalchuk
Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4.
Unfortunately the reproducer starts with the same error message in the log
(see attached).

I cannot say whether behavior of the whole cluster will change on 2.4, I
mean if the cluster can start on corrupted data on 2.4, because we have
wiped the data and restarted the cluster where the problem has arrived.
We'll move to 2.4 next week and continue testing of our software. We are
moving forward to production in April/May, and it would be good if we get
some clue how to deal with such situation with data in the future.



​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 16 March 2018 at 17:03, Dmitry Pavlov <dpavlov@gmail.com> wrote:

> Hi Arseny,
>
> I've observed in reproducer
> ignite_version=2.3.0
>
> Could you check if it is reproducible in our freshest release 2.4.0.
>
> I'm not sure about ticket number, but it is quite possible issue is
> already fixed.
>
> Sincerely,
> Dmitriy Pavlov
>
> чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <dpavlov@gmail.com>:
>
>> Hi Alexey,
>>
>> It may be serious issue. Could you recommend expert here who can pick up
>> this?
>>
>> Sincerely,
>> Dmitriy Pavlov
>>
>> чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <
>> arseny.kovalc...@synesis.ru>:
>>
>>> Hi, guys.
>>>
>>> I've got a reproducer for a problem which is generally reported as
>>> "Caused by: java.lang.IllegalStateException: Failed to get page IO
>>> instance (page content is corrupted)". Actually it reproduces the result. I
>>> don't have an idea how the data has been corrupted, but the cluster node
>>> doesn't want to start with this data.
>>>
>>> We got the issue again when some of server nodes were restarted several
>>> times by kubernetes. I suspect that the data got corrupted during such
>>> restarts. But the main functionality that we really desire to have, that
>>> the cluster DOESN'T HANG during next restart even if the data is corrupted!
>>> Anyway, there is no a tool that can help to correct such data, and as a
>>> result we wipe all data manually to start the cluster. So, having warnings
>>> about corrupted data in logs and just working cluster is the expected
>>> behavior.
>>>
>>> How to reproduce:
>>> 1. Download the data from here https://storage.
>>> googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
>>> 2. Download and import Gradle project https://storage.
>>> googleapis.com/pub-data-0/project.tar.gz (~100Kb)
>>> 3. Unpack the data to the home folder, say /home/user1. You should get
>>> the path like */home/user1/data5*. Inside data5 you should have
>>> binary_meta, db, marshaller.
>>> 4. Open *src/main/resources/data-test.xml* and put the absolute path of
>>> unpacked data into *workDirectory* property of *igniteCfg5* bean. In
>>> this example it should be */home/user1/data5.* Do not
>>> edit consistentId! The consistentId is ignite-instance-5, so the real data
>>> is in the data5/db/ignite_instance_5 folder
>>> 5. Start application from ru.synesis.kipod.DataTestBootApp
>>> 6. Enjoy
>>>
>>> Hope it will help.
>>>
>>>
>>> ​
>>> Arseny Kovalchuk
>>>
>>> Senior Software Engineer at Synesis
>>> skype: arseny.kovalchuk
>>> mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>> ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>>
>>> On 26 December 2017 at 21:15, Denis Magda <dma...@apache.org> wrote:
>>>
>>>> Cross-posting to the dev list.
>>>>
>>>> Ignite persistence maintainers please chime in.
>>>>
>>>> —
>>>> Denis
>>>>
>>> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <
>>>> arseny.kovalc...@synesis.ru> wrote:
>>>>
>>>> Hi guys.
>>>>
>>>> Another issue when using Ignite 2.3 with native persistence enabled.
>>>> See details below.
>>>>
>>>> We deploy Ignite along with our services in Kubernetes (v 1.8) on
>>>> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
>>>> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>>>>
>>>> We put about 230 events/second into 

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

2018-03-15 Thread Arseny Kovalchuk
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused
by: java.lang.IllegalStateException: Failed to get page IO instance (page
content is corrupted)". Actually it reproduces the result. I don't have an
idea how the data has been corrupted, but the cluster node doesn't want to
start with this data.

We got the issue again when some of server nodes were restarted several
times by kubernetes. I suspect that the data got corrupted during such
restarts. But the main functionality that we really desire to have, that
the cluster DOESN'T HANG during next restart even if the data is corrupted!
Anyway, there is no a tool that can help to correct such data, and as a
result we wipe all data manually to start the cluster. So, having warnings
about corrupted data in logs and just working cluster is the expected
behavior.

How to reproduce:
1. Download the data from here
https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project
https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the
path like */home/user1/data5*. Inside data5 you should have binary_meta,
db, marshaller.
4. Open *src/main/resources/data-test.xml* and put the absolute path of
unpacked data into *workDirectory* property of *igniteCfg5* bean. In this
example it should be */home/user1/data5.* Do not edit consistentId!
The consistentId is ignite-instance-5, so the real data is in
the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.


​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 26 December 2017 at 21:15, Denis Magda <dma...@apache.org> wrote:

> Cross-posting to the dev list.
>
> Ignite persistence maintainers please chime in.
>
> —
> Denis
>
> On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <arseny.kovalc...@synesis.ru>
> wrote:
>
> Hi guys.
>
> Another issue when using Ignite 2.3 with native persistence enabled. See
> details below.
>
> We deploy Ignite along with our services in Kubernetes (v 1.8) on
> premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite
> version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD.
>
> We put about 230 events/second into Ignite, 70% of events are ~200KB in
> size and 30% are 5000KB. Smaller events have indexed fields and we query
> them via SQL.
>
> The cluster is activated from a client node which also streams events into
> Ignite from Kafka. We use custom implementation of streamer which uses
> cache.putAll() API.
>
> We started cluster from scratch without any persistent data. After a while
> we got corrupted data with the error message.
>
> [2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%]
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader:
> - Partition eviction failed, this can cause grid hang.
> class org.apache.ignite.IgniteException: Runtime failure on search row:
> Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val:
> ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419,
> face_last_name=null, face_list_id=null, channel=171, source=,
> face_similarity=null, license_plate_number=null, descriptors=null,
> cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854,
> stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854,
> persistent=false, face_first_name=null, license_plate_first_name=null,
> face_full_name=null, level=0, module=Kpx.Synesis.Outdoor,
> end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0,
> human, 0, truck, 0, start_time=1513946618964, processed=false,
> kafka_offset=111259, license_plate_last_name=null, armed=false,
> license_plate_country=null, topic=MovingObject, comment=,
> expiration=1514033024000, original_id=null, license_plate_lists=null], ver:
> GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][
> 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964,
> 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259,
> 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null,
> null, null, null, null, null, null, null, null ]
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.doRemove(BPlusTree.java:1787)
> at org.apache.ignite.internal.processors.cache.persistence.tree
> .BPlusTree.remove(BPlusTree.java:1578)
> at org.apache.ignite.internal.processors.query.h2.database.H2Tr
> eeIndex.remove(H2TreeIndex.java:216)
> at org.apache.ignite.internal.processors.query.h2.opt.GridH2Tab
> le.d

Re: Runtime.availableProcessors() returns hardware's CPU count which is the issue with Ignite in Kubernetes

2017-12-26 Thread Arseny Kovalchuk
Hi Stanislav.

We use OpenJDK and use Alpine Linux based images. See java version below.
In our environment availableProcessors returns CPU's for the host.

Did you mean to try Oracle's JDK 8u151?

arseny@kovalchuka-ubuntu:~/kipod-x$ ku exec ignite-instance-0 -ti bash
bash-4.4# java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (IcedTea 3.6.0) (Alpine 8.151.12-r0)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
bash-4.4# jjs
jjs> print(java.lang.Runtime.runtime.availableProcessors());
40
jjs>


​
Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​

On 26 December 2017 at 16:56, Yakov Zhdanov <yzhda...@gridgain.com> wrote:

> Ilya, agree. I like IGNITE_AVAILABLE_CPU more.
>
> Yakov Zhdanov,
> www.gridgain.com
>
> 2017-12-26 16:36 GMT+03:00 Ilya Lantukh <ilant...@gridgain.com>:
>
>> Hi Yakov,
>>
>> I think that property IGNITE_NODES_PER_HOST, as you suggested, would be
>> confusing, because users might want to reduce amount of available resources
>> for ignite node not only because they run multiple nodes per host, but also
>> because they run other software. Also, in my opinion all types of system
>> resources (CPU, memory, network) shouldn't be scaled using the same value.
>>
>> So I'd prefer to have IGNITE_CONCURRENCY_LEVEL or
>> IGNITE_AVAILABLE_PROCESSORS, as it was originally suggested.
>>
>> On Tue, Dec 26, 2017 at 4:05 PM, Yakov Zhdanov <yzhda...@apache.org>
>> wrote:
>>
>>> Cross-posting to dev list.
>>>
>>> Guys,
>>>
>>> Suggestion below makes sense to me. Filed a ticket
>>> https://issues.apache.org/jira/browse/IGNITE-7310
>>>
>>> Perhaps, Arseny would like to provide a PR himself ;)
>>>
>>> --Yakov
>>>
>>> 2017-12-26 14:32 GMT+03:00 Arseny Kovalchuk <arseny.kovalc...@synesis.ru
>>> >:
>>>
>>> > Hi guys.
>>> >
>>> > Ignite configures all thread pools, selectors, etc. basing on
>>> Runtime.availableProcessors()
>>> > which seems not correct in containerized environment. In Kubernetes
>>> with
>>> > Docker that method returns CPU count of a Node/machine, which is 64 in
>>> our
>>> > particular case. But those 64 CPU and their timings are shared between
>>> > other stuff on the node like other Pods and services. Appropriate
>>> value of
>>> > available cores for Pod is usually configured as CPU Resource and
>>> estimated
>>> > basing on different things taking performance into account. General
>>> idea,
>>> > if you want to run several Pods on the same node, they all should
>>> request
>>> > less resources then the node provides. So, we give 4-8 cores for Ignite
>>> > instance in Kubernetes, but Ignite's thread pools are configured like
>>> they
>>> > get all 64 CPUs, and in turn we get a lot of threads for the Pod with
>>> 4-8
>>> > cores available.
>>> >
>>> > Now we manually set appropriate values for all available properties
>>> which
>>> > relate to thread pools.
>>> >
>>> > Would it be correct to have one environment variable, say
>>> > IGNITE_CONCURRENCY_LEVEL which will be used as a reference value for
>>> those
>>> > configurations and by default equals to Runtime.availableProcessors()?
>>> >
>>> > Thanks.
>>> >
>>> > ​
>>> > Arseny Kovalchuk
>>> >
>>> > Senior Software Engineer at Synesis
>>> > skype: arseny.kovalchuk
>>> > mobile: +375 (29) 666-16-16 <+375%2029%20666-16-16>
>>> > ​LinkedIn Profile <http://www.linkedin.com/in/arsenykovalchuk/en>​
>>> >
>>>
>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>
>