[jira] [Created] (IGNITE-12325) GridCacheMapEntry reservation mechanism is broken with enabled cache store
Pavel Kovalenko created IGNITE-12325: Summary: GridCacheMapEntry reservation mechanism is broken with enabled cache store Key: IGNITE-12325 URL: https://issues.apache.org/jira/browse/IGNITE-12325 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Fix For: 2.8 Entry deferred deletion was disabled after https://issues.apache.org/jira/browse/IGNITE-11704 in transactional caches. However, if cache store is enabled there is a race with cache entry reservation after transactional remove and clear reservation after cache load: {noformat} java.lang.AssertionError: GridDhtCacheEntry [rdrs=ReaderId[] [ReaderId [nodeId=96c87c98-2524-4f9e-8a2f-6cfceda5, msgId=22663371, txFut=null], ReaderId [nodeId=68130805-0dc8-4ef4-abf7-7e7cde86, msgId=22663375, txFut=null], ReaderId [nodeId=b4a8abce-8d0e-4459-b93a-a734ad64, msgId=22663370, txFut=null]], part=8, super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=8, val=8, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=8, extras=null, flags=2]]] at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.clearReserveForLoad(GridCacheMapEntry.java:3616) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.clearReservationsIfNeeded(GridCacheAdapter.java:2429) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.access$400(GridCacheAdapter.java:179) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2309) at org.apache.ignite.internal.processors.cache.GridCacheAdapter$18.call(GridCacheAdapter.java:2217) at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6963) at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:844) {noformat} The issue can be resolved with enabled deferred delete if cache store is enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12299) Store tombstone links into separate BPlus tree to avoid partition full-scan during tombstones remove
Pavel Kovalenko created IGNITE-12299: Summary: Store tombstone links into separate BPlus tree to avoid partition full-scan during tombstones remove Key: IGNITE-12299 URL: https://issues.apache.org/jira/browse/IGNITE-12299 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Fix For: 2.9 Currently, we can't identify which keys are tombstones in the partition fastly. To collect tombstones we need to make a full-scan BPlus tree. It can slowdown node performance when rebalance is finished and tombstones cleanup is needed. We can introduce a separate BPlus tree (like for TTL) inside partition where we can store links to tombstone keys. When tombstones cleanup is needed we can make a fast scan for tombstones using the only a subset of the keys stored to this tree. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12298) Write tombstones on incomplete baseline to get rid of partition cleanup
Pavel Kovalenko created IGNITE-12298: Summary: Write tombstones on incomplete baseline to get rid of partition cleanup Key: IGNITE-12298 URL: https://issues.apache.org/jira/browse/IGNITE-12298 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Fix For: 2.9 After tombstone objects are introduced https://issues.apache.org/jira/browse/IGNITE-11704 we can write tombstones on OWNING nodes if the baseline is incomplete (some of the backup nodes are left). When baseline completes and old nodes return back we can avoid partition cleanup on those nodes before rebalance. We can translate the whole OWNING partition state including tombstones that will clear the data that was removed when node was offline. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12297) Detect lost partitions is not happened during cluster activation
Pavel Kovalenko created IGNITE-12297: Summary: Detect lost partitions is not happened during cluster activation Key: IGNITE-12297 URL: https://issues.apache.org/jira/browse/IGNITE-12297 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.8 We invoke `detectLostPartitions` during PME only if there is a server join or server left. However, we can activate a persistent cluster where a partition may have MOVING status on all nodes. In this case, a partition may stay in MOVING state forever before any other topology event. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12255) Cache affinity fetching and calculation on client nodes may be broken in some cases
Pavel Kovalenko created IGNITE-12255: Summary: Cache affinity fetching and calculation on client nodes may be broken in some cases Key: IGNITE-12255 URL: https://issues.apache.org/jira/browse/IGNITE-12255 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.7, 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 We have a cluster with server and client nodes. We dynamically start several caches on a cluster. Periodically we create and destroy some temporary cache in a cluster to move up cluster topology version. At the same time, a random client node chooses a random existing cache and performs operations on that cache. It leads to an exception on client node that affinity is not initialized for a cache during cache operation like: Affinity for topology version is not initialized [topVer = 8:10, head = 8:2] This exception means that the last affinity for a cache is calculated on version [8,2]. This is a cache start version. It happens because during creating/destroying some temporary cache we don’t re-calculate affinity for all existing but not already accessed caches on client nodes. Re-calculate in this case is cheap - we just copy affinity assignment and increment topology version. As a solution, we need to fetch affinity on client node join for all caches. Also, we need to re-calculate affinity for all affinity holders (not only for started caches or only configured caches) for all topology events that happened in a cluster on a client node. This solution showed the existing race between client node join and concurrent cache destroy. The race is the following: Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node. The race may be fixed with the following change: If the coordinator doesn’t have an affinity for requested cache from the client node, it doesn’t break PME with assertion error, just doesn’t send affinity for that cache to a client node. When the client node receives FullMessage and sees that affinity for some requested cache doesn’t exist, it just closes cache proxy for user interactions which throws CacheStopped exception for every attempt to use that cache. This is safe behavior because cache destroy event should be happened on the client node soon and destroy that cache completely. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12088) Cache or template name should be validated before attempt to start
Pavel Kovalenko created IGNITE-12088: Summary: Cache or template name should be validated before attempt to start Key: IGNITE-12088 URL: https://issues.apache.org/jira/browse/IGNITE-12088 Project: Ignite Issue Type: Bug Components: cache Reporter: Pavel Kovalenko Fix For: 2.8 If set too long cache name it can be a cause of impossibility to create work directory for that cache: {noformat} [2019-08-20 19:35:42,139][ERROR][exchange-worker-#172%node1%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to initialize cache working directory (failed to create, make sure the work folder has correct permissions): /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, writeBehindFlushSize=10240, writeBehindFlushFreq=5000, writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, encryptionEnabled=false, diskPageCompression=null, diskPageCompressionLevel=null]0]] class org.apache.ignite.IgniteCheckedException: Failed to initialize cache working directory (failed to create, make sure the work folder has correct permissions): /home/gridgain/projects/incubator-ignite/work/db/node1/cache-CacheConfiguration [name=ccfg3staticTemplate*, grpName=null, memPlcName=null, storeConcurrentLoadAllThreshold=5, rebalancePoolSize=1, rebalanceTimeout=1, evictPlc=null, evictPlcFactory=null, onheapCache=false, sqlOnheapCache=false, sqlOnheapCacheMaxSize=0, evictFilter=null, eagerTtl=true, dfltLockTimeout=0, nearCfg=null, writeSync=null, storeFactory=null, storeKeepBinary=false, loadPrevVal=false, aff=null, cacheMode=PARTITIONED, atomicityMode=null, backups=6, invalidate=false, tmLookupClsName=null, rebalanceMode=ASYNC, rebalanceOrder=0, rebalanceBatchSize=524288, rebalanceBatchesPrefetchCnt=2, maxConcurrentAsyncOps=500, sqlIdxMaxInlineSize=-1, writeBehindEnabled=false, writeBehindFlushSize=10240, writeBehindFlushFreq=5000, writeBehindFlushThreadCnt=1, writeBehindBatchSize=512, writeBehindCoalescing=true, maxQryIterCnt=1024, affMapper=null, rebalanceDelay=0, rebalanceThrottle=0, interceptor=null, longQryWarnTimeout=3000, qryDetailMetricsSz=0, readFromBackup=true, nodeFilter=null, sqlSchema=null, sqlEscapeAll=false, cpOnRead=true, topValidator=null, partLossPlc=IGNORE, qryParallelism=1, evtsDisabled=false, encryptionEnabled=false, diskPageCompression=null, diskPageCompressionLevel=null]0 at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:769) at org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.checkAndInitCacheWorkDir(FilePageStoreManager.java:748) at org.apache.ignite.internal.processors.cache.CachesRegistry.persistCacheConfigurations(CachesRegistry.java:289) at org.apache.ignite.internal.processors.cache.CachesRegistry.registerAllCachesAndGroups(CachesRegistry.java:264) at org.apache.ignite.internal.processors.cache.CachesRegistry.update(CachesRegistry.java:202) at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:850) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:1306) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:846) at
[jira] [Created] (IGNITE-11852) Assertion errors when changing PME coordinator to locally joining node
Pavel Kovalenko created IGNITE-11852: Summary: Assertion errors when changing PME coordinator to locally joining node Key: IGNITE-11852 URL: https://issues.apache.org/jira/browse/IGNITE-11852 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.7, 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 When PME coordinator changed to locally joining node several assertion errors may occur: 1. When some other joining nodes finished PME: {noformat} [13:49:58] (err) Failed to notify listener: o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1...@27296181java.lang.AssertionError: AffinityTopologyVersion [topVer=2, minorTopVer=0] at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1546) at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$11.applyx(CacheAffinitySharedManager.java:1535) at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.lambda$forAllRegisteredCacheGroups$e0a6939d$1(CacheAffinitySharedManager.java:1281) at org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10929) at org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10831) at org.apache.ignite.internal.util.IgniteUtils.doInParallel(IgniteUtils.java:10811) at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1280) at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onLocalJoin(CacheAffinitySharedManager.java:1535) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processFullMessage(GridDhtPartitionsExchangeFuture.java:4189) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4731) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3400(GridDhtPartitionsExchangeFuture.java:145) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4622) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$8$1$1.apply(GridDhtPartitionsExchangeFuture.java:4611) at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:466) at org.apache.ignite.internal.util.future.GridCompoundFuture.checkComplete(GridCompoundFuture.java:281) at org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:143) at org.apache.ignite.internal.util.future.GridCompoundFuture.apply(GridCompoundFuture.java:44) at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:398) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:510) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:489) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:455) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.InitNewCoordinatorFuture.onMessage(InitNewCoordinatorFuture.java:253) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2731) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1917) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1300(GridCachePartitionExchangeManager.java:162) at
[jira] [Created] (IGNITE-11773) JDBC suite hangs due to cleared non-serializable proxy objects
Pavel Kovalenko created IGNITE-11773: Summary: JDBC suite hangs due to cleared non-serializable proxy objects Key: IGNITE-11773 URL: https://issues.apache.org/jira/browse/IGNITE-11773 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 {noformat} [01:53:02]W: [org.apache.ignite:ignite-clients] java.lang.AssertionError [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.testframework.junits.GridAbstractTest$SerializableProxy.readResolve(GridAbstractTest.java:2419) [01:53:02]W: [org.apache.ignite:ignite-clients] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [01:53:02]W: [org.apache.ignite:ignite-clients] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [01:53:02]W: [org.apache.ignite:ignite-clients] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [01:53:02]W: [org.apache.ignite:ignite-clients] at java.lang.reflect.Method.invoke(Method.java:498) [01:53:02]W: [org.apache.ignite:ignite-clients] at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260) [01:53:02]W: [org.apache.ignite:ignite-clients] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2078) [01:53:02]W: [org.apache.ignite:ignite-clients] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573) [01:53:02]W: [org.apache.ignite:ignite-clients] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:141) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:93) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:163) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:81) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10039) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.deserialize(CacheConfigurationEnricher.java:151) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrich(CacheConfigurationEnricher.java:122) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.cache.CacheConfigurationEnricher.enrichFully(CacheConfigurationEnricher.java:143) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.cache.GridCacheProcessor.getConfigFromTemplate(GridCacheProcessor.java:3776) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.GridQueryProcessor.dynamicTableCreate(GridQueryProcessor.java:1549) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommandH2(CommandProcessor.java:437) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.h2.CommandProcessor.runCommand(CommandProcessor.java:195) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.executeCommand(IgniteH2Indexing.java:954) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.querySqlFields(IgniteH2Indexing.java:1038) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2292) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2288) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36) [01:53:02]W: [org.apache.ignite:ignite-clients] at org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2804) [01:53:02]W:
[jira] [Created] (IGNITE-11455) Introduce free lists rebuild mechanism
Pavel Kovalenko created IGNITE-11455: Summary: Introduce free lists rebuild mechanism Key: IGNITE-11455 URL: https://issues.apache.org/jira/browse/IGNITE-11455 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.0 Reporter: Pavel Kovalenko Fix For: 2.8 Sometimes the state of free lists become invalid like in https://issues.apache.org/jira/browse/IGNITE-10669 . It leads the node to an unrecoverable state. At the same time, free lists don't hold any critical or data information and can be built from scratch using existing data pages. It may be useful to introduce a mechanism to rebuild free lists using an optimal algorithm of partition data pages scanning. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10821) Caching affinity with affinity similarity key is broken
Pavel Kovalenko created IGNITE-10821: Summary: Caching affinity with affinity similarity key is broken Key: IGNITE-10821 URL: https://issues.apache.org/jira/browse/IGNITE-10821 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 When some cache groups have the same affinity function, number of partitions, backups and the same node filter they can use the same affinity distribution without needs for explicit recalculating. These parameters are called as "Affinity similarity key". In case of affinity recalculation caching affinity using this key may speed-up the process. However, after https://issues.apache.org/jira/browse/IGNITE-9561 merge this mechanishm become broken, because parallell execution of affinity recalculation for the similar affinity groups leads to caching affinity misses. To fix it we should couple together similar affinity groups and run affinity recalculation for them in one thread, caching previous results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10799) Optimize affinity initialization/re-calculation
Pavel Kovalenko created IGNITE-10799: Summary: Optimize affinity initialization/re-calculation Key: IGNITE-10799 URL: https://issues.apache.org/jira/browse/IGNITE-10799 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.1 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 In case of persistence enabled and a baseline is set we have 2 main approaches to recalculate affinity: {noformat} org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerJoinWithExchangeMergeProtocol org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#onServerLeftWithExchangeMergeProtocol {noformat} Both of them following the same approach of recalculating: 1) Take a current baseline (ideal assignment). 2) Filter out offline nodes from it. 3) Choose new primary nodes if previous went away. 4) Place temporal primary nodes to late affinity assignment set. Looking at implementation details we may notice that we do a lot of unnecessary online nodes cache lookups and array list copies. The performance becomes too slow if we do recalculate affinity for replicated caches (It takes P * N on each node, where P - partitions count, N - the number of nodes in the cluster). In case of large partitions count or large cluster, it may take few seconds, which is unacceptable, because this process happens during PME and freezes ongoing cluster operations. We should investigate possible bottlenecks and improve the performance of affinity recalculation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10771) Print troubleshooting hint when exchange latch got stucked
Pavel Kovalenko created IGNITE-10771: Summary: Print troubleshooting hint when exchange latch got stucked Key: IGNITE-10771 URL: https://issues.apache.org/jira/browse/IGNITE-10771 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.8 Sometimes users face with a problem when exchange latch can't be completed: {noformat} 2018-12-12 07:07:57:563 [exchange-worker-#42] WARN o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture:488 - Unable to await partitions release latch within timeout: ClientLatch [coordinator=ZookeeperClusterNode [id=6b9fc6e4-5b6a-4a98-be4d-6bc1aa5c014c, addrs=[172.17.0.1, 10.0.230.117, 0:0:0:0:0:0:0:1%lo, 127.0.0.1], order=3, loc=false, client=false], ackSent=true, super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=45, minorTopVer=1]]] {noformat} It may indicate that some node in a cluster can' t finish partitions release (finish all ongoing operations at the previous topology version) or it can be silent network problem. We should print to log a hint how to troubleshoot it to reduce the number of questions about such problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10749) Improve speed of checkpoint finalization on binary memory recovery
Pavel Kovalenko created IGNITE-10749: Summary: Improve speed of checkpoint finalization on binary memory recovery Key: IGNITE-10749 URL: https://issues.apache.org/jira/browse/IGNITE-10749 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.0 Reporter: Pavel Kovalenko Fix For: 2.8 Stopping node during checkpoint leads to binary memory recovery after node start. When binary memory is restored node performs checkpoint that fixes the consistent state of the page memory. It happens there {noformat} org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#finalizeCheckpointOnRecovery {noformat} Looking at the implementation of this method we can notice that it performs finalization in 1 thread, which is not optimal. This process can be speed-up using parallelization of collecting checkpoint pages like in regular checkpoints. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10625) Do first checkpoint on node start before join to topology
Pavel Kovalenko created IGNITE-10625: Summary: Do first checkpoint on node start before join to topology Key: IGNITE-10625 URL: https://issues.apache.org/jira/browse/IGNITE-10625 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.8 If a node joins to active cluster we do the first checkpoint during PME when partition states have restored here {code:java} org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopology#afterStateRestored {code} In IGNITE-9420 we moved logical recovery phase before joining to topology and currently when a node joins to active cluster it already has all recovered partitions. It means that we can safely do the first checkpoint after all logical updates are applied. This change will accelerate PME process if there were a lot of applied updates during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10624) Cache deployment id may be different that cluster-wide after recovery
Pavel Kovalenko created IGNITE-10624: Summary: Cache deployment id may be different that cluster-wide after recovery Key: IGNITE-10624 URL: https://issues.apache.org/jira/browse/IGNITE-10624 Project: Ignite Issue Type: Bug Components: cache, sql Affects Versions: 2.8 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 When schema for a cache is changing (GridQueryProcessor#processSchemaOperationLocal), it may produce false-negative "CACHE_NOT_FOUND" message if a cache was started during recovery while cluster-wide descriptor was changed. {noformat} if (cacheInfo == null || !F.eq(depId, cacheInfo.dynamicDeploymentId())) throw new SchemaOperationException(SchemaOperationException.CODE_CACHE_NOT_FOUND, cacheName); {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10556) Attempt to decrypt data records during read-only metastorage recovery leads to NPE
Pavel Kovalenko created IGNITE-10556: Summary: Attempt to decrypt data records during read-only metastorage recovery leads to NPE Key: IGNITE-10556 URL: https://issues.apache.org/jira/browse/IGNITE-10556 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Fix For: 2.8 Stacktrace: {noformat} Caused by: java.lang.NullPointerException at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.lambda$next$0(GridCacheDatabaseSharedManager.java:4795) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreStateContext.next(GridCacheDatabaseSharedManager.java:4799) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$RestoreLogicalState.next(GridCacheDatabaseSharedManager.java:4926) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLogicalUpdates(GridCacheDatabaseSharedManager.java:2370) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:733) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4493) at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048) ... 20 more {noformat} It happens because there is no encryption key for that cache group. Encryption keys are initialized after read-only metastorage is ready. There is a bug in RestoreStateContext which tries to filter out DataEntries in DataRecord by group id during read-only metastorage recovery. We should explicitly skip such records before filtering. As a possible solution, we should provide more flexible records filter to RestoreStateContext if we do recovery of read-only metastorage. We should also return something more meaningful instead of null if no encryption key is found for DataRecord, as it can be a silent problem for components iterating over WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10493) Refactor exchange stages time measurements
Pavel Kovalenko created IGNITE-10493: Summary: Refactor exchange stages time measurements Key: IGNITE-10493 URL: https://issues.apache.org/jira/browse/IGNITE-10493 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.7 Reporter: Pavel Kovalenko Fix For: 2.8 At the current implementation, we don't cover and measure all possible code executions that influence on PME time. Instead of it we just measure the hottest separate parts with the following hardcoded pattern: {noformat} long time = currentTime(); ... // some code block print ("Stage name performed in " + (currentTime() - time)); {noformat} This approach can be improved. Instead of declaring time variable and print the message to log immediately we can introduce a utility class (TimesBag) that will hold all stages and their times. The content of TimesBag can be printed when the exchange future is done. As exchange is a linear process that executes init stage by exchange-worker and finish stage by one of the sys thread we can easily cover all exchange code base by time cutoffs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10485) Ability to get know more about cluster state before NODE_JOINED event is fired cluster-wide
Pavel Kovalenko created IGNITE-10485: Summary: Ability to get know more about cluster state before NODE_JOINED event is fired cluster-wide Key: IGNITE-10485 URL: https://issues.apache.org/jira/browse/IGNITE-10485 Project: Ignite Issue Type: Improvement Components: cache Reporter: Pavel Kovalenko Fix For: 2.8 Currently there are no good possibilities to get more knowledge about cluster before PME on node join start. It might be usefult to do some pre-work (activate components if cluster is active, calculate baseline affinity, cleanup pds if baseline changed, etc.) before actual NODE_JOIN event is triggered cluster-wide and PME is started. Such pre-work will significantly speed-up PME in case of node join. Currently the only place where it can be done is during processing NodeAdded message on local joining node. But it's not a good idea, because it will freeze processing new discovery messages cluster-wide. I see 2 ways how to implement it: 1) Introduce new intermediate state of node when it's discovered, but discovery event on node join is not triggered yet. This is right, but complicated change, because it requires revisiting joining process both in Tcp and Zk discovery protocols with extra failover scenarios. 2) Try to get this information and do pre-work before discovery manager start, using e.g. GridRestProcessor. This looks much simplier, but we can have some races there, when during pre-work cluster state has been changed (deactivation, baseline change). In this case we should rollback it or just stop/restart the node to avoid cluster instability. However these are rare scenarios in real world (e.g. start baseline node and start deactivation process right after node recovery is finished). For starters we can expose baseline and cluster state in our rest endpoint and try to move out mentioned above pre-work things from PME. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10397) SQL Schema may be lost after cluster activation and simple query run
Pavel Kovalenko created IGNITE-10397: Summary: SQL Schema may be lost after cluster activation and simple query run Key: IGNITE-10397 URL: https://issues.apache.org/jira/browse/IGNITE-10397 Project: Ignite Issue Type: Bug Affects Versions: 2.8 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 Scenario: 1) Start 3 grids in a multithread mode with auto-activation. 2) Start the client. 3) Run a simple query like this {noformat} cache(DEFAULT_CACHE_NAME + 0).query(new SqlQuery<>(Integer.class, "1=1")).getAll(); {noformat} Exception with the message that schema not found will be thrown: {noformat} [2018-11-23 19:56:57,284][ERROR][query-#223%distributed.CacheMessageStatsIndexingTest2%][GridMapQueryExecutor] Failed to execute local query. class org.apache.ignite.internal.processors.query.IgniteSQLException: Failed to set schema for DB connection for thread [schema=default0] at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForThread(IgniteH2Indexing.java:549) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForSchema(IgniteH2Indexing.java:392) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest0(GridMapQueryExecutor.java:767) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onQueryRequest(GridMapQueryExecutor.java:637) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor.onMessage(GridMapQueryExecutor.java:224) at org.apache.ignite.internal.processors.query.h2.twostep.GridMapQueryExecutor$2.onMessage(GridMapQueryExecutor.java:184) at org.apache.ignite.internal.managers.communication.GridIoManager$ArrayListener.onMessage(GridIoManager.java:2333) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1091) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.h2.jdbc.JdbcSQLException: Schema "default0" not found; SQL statement: SET SCHEMA "default0" [90079-195] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) at org.h2.message.DbException.get(DbException.java:179) at org.h2.message.DbException.get(DbException.java:155) at org.h2.engine.Database.getSchema(Database.java:1755) at org.h2.command.dml.Set.update(Set.java:408) at org.h2.command.CommandContainer.update(CommandContainer.java:101) at org.h2.command.Command.executeUpdate(Command.java:260) at org.h2.jdbc.JdbcStatement.executeUpdateInternal(JdbcStatement.java:137) at org.h2.jdbc.JdbcStatement.executeUpdate(JdbcStatement.java:122) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.connectionForThread(IgniteH2Indexing.java:541) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10298) Possible deadlock between restore partition states and checkpoint begin
Pavel Kovalenko created IGNITE-10298: Summary: Possible deadlock between restore partition states and checkpoint begin Key: IGNITE-10298 URL: https://issues.apache.org/jira/browse/IGNITE-10298 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.8 There is possible deadlock between "restorePartitionStates" phase during caches starting and currently running checkpointer: {noformat} "db-checkpoint-thread-#50" #89 prio=5 os_prio=0 tid=0x1ad57800 nid=0x2b58 waiting on condition [0x7e42e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0xddabfcc8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7515) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1331) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.fullSize(GridCacheOffheapManager.java:1459) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:3397) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:3009) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2934) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) "exchange-worker-#42" #69 prio=5 os_prio=0 tid=0x1c1cd800 nid=0x259c waiting on condition [0x249ae000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x80b634a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1328) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:1212) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.initialUpdateCounter(GridCacheOffheapManager.java:1537) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onPartitionInitialCounterUpdated(GridCacheOffheapManager.java:633) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2365) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1174) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1119) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:703) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2364) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) {noformat} Possible solution is performing
[jira] [Created] (IGNITE-10235) Cache registered in QueryManager twice if parallel caches start is disabled
Pavel Kovalenko created IGNITE-10235: Summary: Cache registered in QueryManager twice if parallel caches start is disabled Key: IGNITE-10235 URL: https://issues.apache.org/jira/browse/IGNITE-10235 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Fix For: 2.8 In case of disabled property IGNITE_ALLOW_START_CACHES_IN_PARALLEL callback that registers cache in QueryManager invoked twice which leads to impossibility to start cache if it was recovered before join to topology. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10226) Partition may restore wrong MOVING state during crash recovery
Pavel Kovalenko created IGNITE-10226: Summary: Partition may restore wrong MOVING state during crash recovery Key: IGNITE-10226 URL: https://issues.apache.org/jira/browse/IGNITE-10226 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.8 The way to get it exists only in versions that don't have IGNITE-9420: 1) Start cache, upload some data to partitions, forceCheckpoint 2) Start uploading additional data. Kill node. Node should be killed with skipping last checkpoint, or during checkpoint mark phase. 3) Re-start node. The crash recovery process for partitions started. When we create partition during crash recovery (topology().forceCreatePartition()) we log it's initial state to WAL. If we have any logical update relates to partition we'll log wrong MOVING state to the end of current WAL. This state will be considered as last valid when we process PartitionMetaStateRecord record's during logical recovery. In "restorePartitionsState" phase this state will be chosen as final and the partition will change to MOVING, even in page memory it has OWNING or something else. To fix this problem in 2.4 - 2.7 versions, additional logging partition state change to WAL during crash recovery (logical recovery) should be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-10035) Fix tests IgniteWalFormatFileFailoverTest
Pavel Kovalenko created IGNITE-10035: Summary: Fix tests IgniteWalFormatFileFailoverTest Key: IGNITE-10035 URL: https://issues.apache.org/jira/browse/IGNITE-10035 Project: Ignite Issue Type: New Feature Components: cache Affects Versions: 2.8 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 After IGNITE-9420 introduce, WAL Archiver component is started together with WAL manager. Tests suppose that WAL Archiver will be started after first activation, and proper file io factory will be injected to it. Need to find out how to inject file io factory before file archiver is started. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9725) Introduce affinity distribution prototype for equal cache group configurations
Pavel Kovalenko created IGNITE-9725: --- Summary: Introduce affinity distribution prototype for equal cache group configurations Key: IGNITE-9725 URL: https://issues.apache.org/jira/browse/IGNITE-9725 Project: Ignite Issue Type: New Feature Components: cache Affects Versions: 2.0 Reporter: Pavel Kovalenko Fix For: 2.8 Currently, we perform affinity re-calculation for each of cache groups, even if configurations (CacheMode, number of backups, affinity function) are equal. If two cache groups have similar affinity function and number of backups we can calculate affinity prototype for such groups once and re-use in every cache group. This change will save time on affinity re-calculation if a cluster has a lot of cache groups with similar affinity function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9683) Create manual pinger for ZK client
Pavel Kovalenko created IGNITE-9683: --- Summary: Create manual pinger for ZK client Key: IGNITE-9683 URL: https://issues.apache.org/jira/browse/IGNITE-9683 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.8 Connection loss with Zookeeper more than ZK session timeout for server nodes is unacceptable. To improve durability of connrction, we need to keep session with ZK as long possible. We need to introduce manual pinger additionally to ZK client and ping ZK server with simple request each tick time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9661) Improve partition states validation
Pavel Kovalenko created IGNITE-9661: --- Summary: Improve partition states validation Key: IGNITE-9661 URL: https://issues.apache.org/jira/browse/IGNITE-9661 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we validate partition states one-by-one and the whole algorithm has complexity O (G * P * N * logP), where G - number of cache groups, P - number of partition in each of cache groups, N - the number of nodes. Overall complexity can be optimized (logP can be removed). We also should consider parallelization of algorithm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9649) Rework logging in important places
Pavel Kovalenko created IGNITE-9649: --- Summary: Rework logging in important places Key: IGNITE-9649 URL: https://issues.apache.org/jira/browse/IGNITE-9649 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.0 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.8 Currently, we have insufficient, incomplete or too sufficient logs at DEBUG and TRACE levels. We should revisit and rework logging in important places of product: 1) Partitions Map Exchange 2) Rebalance 3) Partitions workflow 4) Time logging for critical places -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9562) Destroyed cache that resurrected on a old offline node breaks PME
Pavel Kovalenko created IGNITE-9562: --- Summary: Destroyed cache that resurrected on a old offline node breaks PME Key: IGNITE-9562 URL: https://issues.apache.org/jira/browse/IGNITE-9562 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.8 Given: 2 nodes, persistence enabled. 1) Stop 1 node 2) Destroy cache through client 3) Start stopped node When the stopped node joins to cluster it starts all caches that it has seen before stopping. If that cache was cluster-widely destroyed it leads to breaking the crash recovery process or PME. Root cause - we don't start/collect caches from the stopped node on another part of a cluster. In case of PARTITIONED cache mode that scenario breaks crash recovery: {noformat} java.lang.AssertionError: AffinityTopologyVersion [topVer=-1, minorTopVer=0] at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:696) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateLocal(GridDhtPartitionTopologyImpl.java:2449) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.afterStateRestored(GridDhtPartitionTopologyImpl.java:679) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restorePartitionStates(GridCacheDatabaseSharedManager.java:2445) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.applyLastUpdates(GridCacheDatabaseSharedManager.java:2321) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreState(GridCacheDatabaseSharedManager.java:1568) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.beforeExchange(GridCacheDatabaseSharedManager.java:1308) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1255) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:766) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2577) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2457) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) {noformat} In case of REPLICATED cache mode that scenario breaks PME coordinator process: {noformat} [2018-09-12 18:50:36,407][ERROR][sys-#148%distributed.CacheStopAndRessurectOnOldNodeTest0%][GridCacheIoManager] Failed to process message [senderId=4b6fd0d4-b756-4a9f-90ca-f0ee2511, messageType=class o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsSingleMessage] java.lang.AssertionError: 3080586 at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.clientTopology(GridCachePartitionExchangeManager.java:815) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updatePartitionSingleMap(GridDhtPartitionsExchangeFuture.java:3621) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2439) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:137) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2261) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2249) at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383) at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2249) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1628) at
[jira] [Created] (IGNITE-9561) Optimize affinity initialization for started cache groups
Pavel Kovalenko created IGNITE-9561: --- Summary: Optimize affinity initialization for started cache groups Key: IGNITE-9561 URL: https://issues.apache.org/jira/browse/IGNITE-9561 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 At the end of {noformat} org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager#processCacheStartRequests {noformat} method we're initializing affinity for cache groups starting at current exchange. We do it one-by-one and synchronously wait for AffinityFetchResponse for each of the starting groups. This is inefficient. We may parallelize this process and speed up caches starting process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9501) Exclude newly joining nodes from exchange latch
Pavel Kovalenko created IGNITE-9501: --- Summary: Exclude newly joining nodes from exchange latch Key: IGNITE-9501 URL: https://issues.apache.org/jira/browse/IGNITE-9501 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we're waiting for latch completion from newly joining nodes. However, such nodes don't have any updates to be synced on wait partitions release. Newly joining nodes may start their caches before exchange latch creation and this can delay exchange process. We should explicitly ignore such nodes and don't include them into latch participants. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9496) Add listenAsync method to GridFutureAdapter
Pavel Kovalenko created IGNITE-9496: --- Summary: Add listenAsync method to GridFutureAdapter Key: IGNITE-9496 URL: https://issues.apache.org/jira/browse/IGNITE-9496 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 Currently, there is no possibility to add an async listener to an internal future with the possibility to choose an appropriate executor for such listener. This would be useful to change thread that will execute a future listener. We should add listenAsync method to GridFutureAdapter and add the possibility to set arbitrary submitter/executor for such listeners. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9494) Communication error resolver may be invoked when topology is under construction
Pavel Kovalenko created IGNITE-9494: --- Summary: Communication error resolver may be invoked when topology is under construction Key: IGNITE-9494 URL: https://issues.apache.org/jira/browse/IGNITE-9494 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 Zookeeper Discovery. During massive node start and join to topology there can happen communication error problems which can lead to invoking communication error resolver. Communication error resolver initiates a peer-to-peer ping process on all alive nodes. Youngest nodes in a cluster may have the not complete picture about alive nodes in a cluster. This can lead to a situation, that youngest node will not ping all available nodes, and the coordinator may decide that those nodes have an unstable network and unexpectedly kill them. We should throttle communication error resolver in case of massive node join and give them a time to get the complete picture about topology. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9493) Communication error resolver shouldn't be invoked if connection with client breaks unexpectedly
Pavel Kovalenko created IGNITE-9493: --- Summary: Communication error resolver shouldn't be invoked if connection with client breaks unexpectedly Key: IGNITE-9493 URL: https://issues.apache.org/jira/browse/IGNITE-9493 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 Currently, we initiate communication error resolving process even if a connection between server and client breaks unexpectedly. This is unnecessary action because client nodes are not important for cluster stability. We should ignore communication errors for client and daemon nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9492) Limit number of threads which process SingleMessage with exchangeId==null
Pavel Kovalenko created IGNITE-9492: --- Summary: Limit number of threads which process SingleMessage with exchangeId==null Key: IGNITE-9492 URL: https://issues.apache.org/jira/browse/IGNITE-9492 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, after each PME coordinator spend a lot of time on processing correcting Single messages (with exchange id == null). This leads to growing inbound/outbound messages queue and delaying other coordinator-aware processes. Processing single messages with exchange id == null are not so important to give all available resources to it. We should limit the number of sys-threads which are able to process such single messages. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9491) Exchange latch coordinator shouldn't be oldest node in a cluster
Pavel Kovalenko created IGNITE-9491: --- Summary: Exchange latch coordinator shouldn't be oldest node in a cluster Key: IGNITE-9491 URL: https://issues.apache.org/jira/browse/IGNITE-9491 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we have a lot of components having coordinator election ability. Each of these components is electing the oldest node as coordinator. It leads to overloading the oldest node and may be a cause of delaying of some processes. The oldest node can have large inbound/outbound messages queue in large topologies which leads to delaying of processing Exchange Latch Ack messages. We should choose secondary oldest node as coordinator to unload the oldest coordinator. This change will significantly accelerate exchange latch waiting process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9449) Lazy unmarshalling of discovery events in TcpDiscovery
Pavel Kovalenko created IGNITE-9449: --- Summary: Lazy unmarshalling of discovery events in TcpDiscovery Key: IGNITE-9449 URL: https://issues.apache.org/jira/browse/IGNITE-9449 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.6, 2.5, 2.4 Reporter: Pavel Kovalenko Fix For: 2.7 Currently disco-msg-worker thread spend major part of time on disocvery message unmarshalling before send it to the next node. In most cases this is unnecessary and message can be send immediately after receiving and notyfing discovery-event-worker. Responsibility of unmarshalling should moved to discovery-event-worker thread and this improvement will significantly reduce latency of sending custom messages across ring. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9420) Move logical recovery phase outside of PME
Pavel Kovalenko created IGNITE-9420: --- Summary: Move logical recovery phase outside of PME Key: IGNITE-9420 URL: https://issues.apache.org/jira/browse/IGNITE-9420 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 Currently, we perform logical recovery in PME here org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#restoreState We should move logical recovery before discovery manager will start. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9419) Avoid saving cache configuration synchronously during PME
Pavel Kovalenko created IGNITE-9419: --- Summary: Avoid saving cache configuration synchronously during PME Key: IGNITE-9419 URL: https://issues.apache.org/jira/browse/IGNITE-9419 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 Currently, we save cache configuration during PME at the activation phase here org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.CachesInfo#updateCachesInfo . We should avoid this, as it performs operations to a disk. We should save it asynchronously or lazy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9418) Avoid initialize file page store manager for caches during PME synchronously
Pavel Kovalenko created IGNITE-9418: --- Summary: Avoid initialize file page store manager for caches during PME synchronously Key: IGNITE-9418 URL: https://issues.apache.org/jira/browse/IGNITE-9418 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we do creation for partition and index files during PME for starting caches at the beginning of org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#readCheckpointAndRestoreMemory method. We should avoid this because sometimes it took a long time as we perform writing to disk. If the cache was registered during PME we should initialize page store lazy or asynchronously. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9398) Reduce time on processing CustomDiscoveryMessage by discovery worker
Pavel Kovalenko created IGNITE-9398: --- Summary: Reduce time on processing CustomDiscoveryMessage by discovery worker Key: IGNITE-9398 URL: https://issues.apache.org/jira/browse/IGNITE-9398 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6, 2.5, 2.4 Reporter: Pavel Kovalenko Fix For: 2.7 Processing discovery CustomMessage may take significant values of time (0.5-0.7 seconds) before sending to next node in the topology. This significantly accumulates the total time of PME if topology has multiple nodes. Let X = time of processing discovery message by discovery-msg-worker on each node before sending to next node. Let N = number of nodes in the topology. Then the minimal total time of exchange will be: T = N * X We shouldn't make heavy actions when process discovery message. Best solution will be separated thread that will do it, while discovery-msg-worker will just pass a message to that thread and send a message immediately to another node in topology. This affects both TcpDiscoverySpi and ZkDiscoverySpi. {noformat} [11:59:33,134][INFO][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Enqueued message type = TcpDiscoveryCustomEventMessage id = e4b542b6561-a38dfe31-dcfd-430b-acb3-5a531db4197e time = 0 [11:59:33,537][INFO][tcp-disco-msg-worker-#2][GridSnapshotAwareClusterStateProcessorImpl] Received activate request with BaselineTopology[id=0] [11:59:33,549][INFO][tcp-disco-msg-worker-#2][GridSnapshotAwareClusterStateProcessorImpl] Started state transition: true [11:59:33,752][INFO][exchange-worker-#62][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=110, minorTopVer=1], crd=true, evt=DISCOVERY_CUSTOM_EVT, evtNode=a38dfe31-dcfd-430b-acb3-5a531db4197e, customEvt=ChangeGlobalStateMessage [id=cea542b6561-47395de6-c204-4576-a0a3-99ec53d41ac3, reqId=5b651439-7a6a-43fc-9cb0-d646c3380576, initiatingNodeId=a38dfe31-dcfd-430b-acb3-5a531db4197e, activate=true, baselineTopology=BaselineTopology [id=0, branchingHash=-69412111965, branchingType='New BaselineTopology', baselineNodes=[node42, node43, node44, node45, node46, node47, node48, node49, node50, node51, node52, node53, node54, node55, node56, node57, node58, node59, node1, node4, node5, node2, node3, node8, node9, node6, node7, node60, node61, node62, node63, node64, node65, node66, node67, node68, node69, node70, node71, node72, node73, node74, node75, node76, node77, node78, node79, node80, node81, node82, node83, node84, node85, node86, node87, node88, node89, node90, node91, node92, node93, node94, node95, node96, node97, node10, node98, node11, node99, node12, node13, node14, node15, node16, node100, node17, node18, node19, node108, node107, node106, node105, node104, node103, node102, node101, node109, node20, node21, node22, node23, node24, node25, node26, node27, node28, node29, node110, node30, node31, node32, node33, node34, node35, node36, node37, node38, node39, node40, node41]], forceChangeBaselineTopology=false, timestamp=1535101173015], allowMerge=false] [11:59:33,753][INFO][exchange-worker-#62][GridDhtPartitionsExchangeFuture] Start activation process [nodeId=1906b9c3-73f4-4c30-85cc-cf6b99c3bab9, client=false, topVer=AffinityTopologyVersion [topVer=110, minorTopVer=1]] [11:59:33,756][INFO][exchange-worker-#62][FilePageStoreManager] Resolved page store work directory: /storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/node1 [11:59:33,756][INFO][exchange-worker-#62][FileWriteAheadLogManager] Resolved write ahead log work directory: /storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/wal/node1 [11:59:33,756][INFO][exchange-worker-#62][FileWriteAheadLogManager] Resolved write ahead log archive directory: /storage/ssd/avolkov/tiden/snapshots-180824-114937/test_pitr/ignite.server.1/work/db/wal/archive/node1 [11:59:33,757][INFO][exchange-worker-#62][FileWriteAheadLogManager] Started write-ahead log manager [mode=LOG_ONLY] [11:59:33,763][INFO][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Processed message type = TcpDiscoveryCustomEventMessage id = e4b542b6561-a38dfe31-dcfd-430b-acb3-5a531db4197e time = 629 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9271) Implement transaction commit using thread per partition model
Pavel Kovalenko created IGNITE-9271: --- Summary: Implement transaction commit using thread per partition model Key: IGNITE-9271 URL: https://issues.apache.org/jira/browse/IGNITE-9271 Project: Ignite Issue Type: Sub-task Components: cache Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we perform commit of a transaction from sys thread and do write operations with multiple partitions. We should delegate such operations to an appropriate thread and wait for results. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9270) Design thread per partition model
Pavel Kovalenko created IGNITE-9270: --- Summary: Design thread per partition model Key: IGNITE-9270 URL: https://issues.apache.org/jira/browse/IGNITE-9270 Project: Ignite Issue Type: Sub-task Components: cache Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 A new model of executing cache partition operations (READ, CREATE, UPDATE, DELETE) should satisfy following conditions 1) All modify operations (CREATE, UPDATE, DELETE) on some partition must be performed by the same thread. 2) Read operations can be executed by any thread. We should investigate performance if we choose dedicated executor service for such operations, or we can use a messaging model to use network threads to perform such operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9206) Node can't join to ring if all existing nodes have stopped and another new node joined ahead
Pavel Kovalenko created IGNITE-9206: --- Summary: Node can't join to ring if all existing nodes have stopped and another new node joined ahead Key: IGNITE-9206 URL: https://issues.apache.org/jira/browse/IGNITE-9206 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.7 TcpDiscovery SPI problem. Situation: Existing cluster with nodes 1 and 2. Nodes 1 and 2 are stopping. 1) Node 3 joins to cluster and sends JoinMessage to node 2. 2) Node 2 is stopping and unable to handle JoinMessage from node 3. Node 3 choose node 1 as the next node to send the message. 3) Node 3 sends JoinMessage to node 1. 4) Node 4 joins to cluster. 5) Node 1 is stopping and unable to handle JoinMessage from node 3. 6) Node 4 sees that there are no alive nodes in the ring at the time and become the first node in the topology. 7) Node 3 sends JoinMessage to Node 4 and this process repeats again and again without any success. At Node 4 logs we can see that remote connection from Node 3 is established but no active actions have performed. Node 3 leaves in CONNECTING state forever. At the same time Node 4 thinks that Node 3 is already in the ring. Failed test: GridCacheReplicatedDataStructuresFailoverSelfTest#testAtomicSequenceConstantTopologyChange Link to TC: https://ci.ignite.apache.org/viewLog.html?buildId=1594376=buildResultsDiv=IgniteTests24Java8_DataStructures Shrinked log: {code:java} [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,733][INFO ][main][root] >>> Stopping grid [name=replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0, id=3e2c94bd-8e98-4dd9-8d1a-befbfe00] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,739][INFO ][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][root] Start node: replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7 [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,740][INFO ][tcp-disco-msg-worker-#2146%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest6%][TcpDiscoverySpi] New next node [newNext=TcpDiscoveryNode [id=3e2c94bd-8e98-4dd9-8d1a-befbfe00, addrs=ArrayList [127.0.0.1], sockAddrs=HashSet [/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1533416953738, loc=false, ver=2.7.0#20180803-sha1:3ab8bbad, isClient=false]] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,741][INFO ][tcp-disco-srvr-#2100%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/127.0.0.1, rmtPort=50099] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,741][INFO ][tcp-disco-srvr-#2100%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi] TCP discovery spawning a new thread for connection [rmtAddr=/127.0.0.1, rmtPort=50099] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,743][INFO ][tcp-disco-sock-reader-#2151%replicated.GridCacheReplicatedDataStructuresFailoverSelfTest0%][TcpDiscoverySpi] Started serving remote node connection [rmtAddr=/127.0.0.1:50099, rmtPort=50099] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,746][INFO ][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7] [00:09:13] : [Step 3/4] [00:09:13] : [Step 3/4] >>>__ [00:09:13] : [Step 3/4] >>> / _/ ___/ |/ / _/_ __/ __/ [00:09:13] : [Step 3/4] >>> _/ // (7 7// / / / / _/ [00:09:13] : [Step 3/4] >>> /___/\___/_/|_/___/ /_/ /___/ [00:09:13] : [Step 3/4] >>> [00:09:13] : [Step 3/4] >>> ver. 2.7.0-SNAPSHOT#20180803-sha1:3ab8bbad [00:09:13] : [Step 3/4] >>> 2018 Copyright(C) Apache Software Foundation [00:09:13] : [Step 3/4] >>> [00:09:13] : [Step 3/4] >>> Ignite documentation: http://ignite.apache.org [00:09:13] : [Step 3/4] [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,746][INFO ][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7] Config URL: n/a [00:09:13] : [Step 3/4] [2018-08-04 21:09:13,747][INFO ][thread-replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7][GridCacheReplicatedDataStructuresFailoverSelfTest7] IgniteConfiguration [igniteInstanceName=replicated.GridCacheReplicatedDataStructuresFailoverSelfTest7, pubPoolSize=8, svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, igfsPoolSize=5, dataStreamerPoolSize=8, utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2, qryPoolSize=8, igniteHome=/data/teamcity/work/9198da4c51c3e112, igniteWorkDir=/data/teamcity/work/9198da4c51c3e112/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@13fed1ec, nodeId=fe9e7ca7-c0fa-4b51-8a87-1255f8c7, marsh=BinaryMarshaller [],
[jira] [Created] (IGNITE-9185) Collect and check update counters visited during WAL rebalance
Pavel Kovalenko created IGNITE-9185: --- Summary: Collect and check update counters visited during WAL rebalance Key: IGNITE-9185 URL: https://issues.apache.org/jira/browse/IGNITE-9185 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently we don't check what update counters we visit during WAL iteration and what data we send to a demander node. There can be situation, that we met last requested update counter in WAL and stop rebalance process, while due to possible DataRecord's reordering we miss some updates after. If rebalance process breaks due to end of WAL, but not all data records were visit, we can easily check what records are missed, cancel rebalance and print useful information to log for further debug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9157) Optimize memory usage of data regions in tests
Pavel Kovalenko created IGNITE-9157: --- Summary: Optimize memory usage of data regions in tests Key: IGNITE-9157 URL: https://issues.apache.org/jira/browse/IGNITE-9157 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 If we use persistence in tests and do not explicitly set the max size of a data region, by default it will be 20% of available RAM on a host. This can lead to memory over-usage and sometimes JVMs, where such tests are running, will be killed by Linux OOM killer. We should find all tests where data region max size has forgotten and set this value explicitly to minimal possible value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9129) P2P class deployment is failed when using ZK discovery
Pavel Kovalenko created IGNITE-9129: --- Summary: P2P class deployment is failed when using ZK discovery Key: IGNITE-9129 URL: https://issues.apache.org/jira/browse/IGNITE-9129 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6, 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 In case of using Zookeeper Discovery, cluster node which joins to a cluster receives information that some user-classes have been already deployed but doesn't exist in the local classpath. In this case, the node tries to request for these classes from nodes that contain it, but do it synchronously during Zookeeper Discovery starting and gets NullPointer when first topology snapshot has not initialized yet. We should request for user-classes asynchronously and only after the first topology has initialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9121) Revisit future.get() usages when process message from Communication SPI
Pavel Kovalenko created IGNITE-9121: --- Summary: Revisit future.get() usages when process message from Communication SPI Key: IGNITE-9121 URL: https://issues.apache.org/jira/browse/IGNITE-9121 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.6, 2.5 Reporter: Pavel Kovalenko Currently, we use explicit synchronous future.get() when process messages from Communication SPI. This potentially may lead to deadlocks to thread-pool exhausting as was showed in IGNITE-9111 e.g. To fix the problem we should determine all places in the code where we synchronously wait for some futures and try to either refactor these places or implement a special exception (which will contain such future) with subsequent retrying a runnable in low-level Communication SPI processing when future will be completed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9111) Do not wait for deactivation in GridClusterStateProcessor#publicApiActiveState
Pavel Kovalenko created IGNITE-9111: --- Summary: Do not wait for deactivation in GridClusterStateProcessor#publicApiActiveState Key: IGNITE-9111 URL: https://issues.apache.org/jira/browse/IGNITE-9111 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5, 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Currently, we wait for activation/deactivation future when check state of the cluster. But when deactivation is in progress it doesn't make sense to wait for it, because after the successful wait we will throw an exception that cluster is not active. Synchronous waiting for deactivation future may lead to deadlocks if operation obtains some locks before checking cluster state. As the solution, we should check and wait only for activation futures. In case of in-progress deactivation, we should fail fast and return "false" from publicApiActiveState method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9088) Add ability to dump persistence after particular test
Pavel Kovalenko created IGNITE-9088: --- Summary: Add ability to dump persistence after particular test Key: IGNITE-9088 URL: https://issues.apache.org/jira/browse/IGNITE-9088 Project: Ignite Issue Type: Improvement Components: persistence Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 Sometimes it's needed to analyze persistence after a particular test finish on TeamCity. We need to add an ability to dump persistence dirs/files to the specified directory on test running host for further analysis. This should be managed by a property. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9086) Error during commit transaction on primary node may lead to breaking transaction data integrity
Pavel Kovalenko created IGNITE-9086: --- Summary: Error during commit transaction on primary node may lead to breaking transaction data integrity Key: IGNITE-9086 URL: https://issues.apache.org/jira/browse/IGNITE-9086 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6, 2.5, 2.4 Reporter: Pavel Kovalenko Fix For: 2.7 Transaction properties are PESSIMISTIC, REPEATABLE READ. If primary partitions participating in the transaction are spread across several nodes and commit is failed on some of the primary nodes while other primary nodes have committed transaction it may lead to breaking transaction data integrity. A data become inconsistent even after rebalance when the node with failed commit returns back to the cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9084) Trash in WAL after node stop may affect WAL rebalance
Pavel Kovalenko created IGNITE-9084: --- Summary: Trash in WAL after node stop may affect WAL rebalance Key: IGNITE-9084 URL: https://issues.apache.org/jira/browse/IGNITE-9084 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 During iteration over WAL we can face with trash in WAL segment, which can remains after node restart. We should handle this situation in WAL rebalance iterator and gracefully stop iteration process. {noformat} [2018-07-25 17:18:21,152][ERROR][sys-#25385%persistence.IgnitePdsTxHistoricalRebalancingTest0%][GridCacheIoManager] Failed to process message [senderId=f0d35df7-ff93-4b6c-b699-45f3e7c3, messageType=class o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionDemandMessage] class org.apache.ignite.IgniteException: Failed to read WAL record at position: 19346739 size: 67108864 at org.apache.ignite.internal.util.lang.GridIteratorAdapter.next(GridIteratorAdapter.java:38) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.advance(GridCacheOffheapManager.java:1033) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.next(GridCacheOffheapManager.java:948) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.nextX(GridCacheOffheapManager.java:917) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$WALHistoricalIterator.nextX(GridCacheOffheapManager.java:842) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.nextX(IgniteRebalanceIteratorImpl.java:130) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.next(IgniteRebalanceIteratorImpl.java:185) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.IgniteRebalanceIteratorImpl.next(IgniteRebalanceIteratorImpl.java:37) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:348) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:370) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:380) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:365) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.IgniteCheckedException: Failed to read WAL record at position: 19346739 size: 67108864 at org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.handleRecordException(AbstractWalRecordsIterator.java:263) at org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advanceRecord(AbstractWalRecordsIterator.java:229) at org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advance(AbstractWalRecordsIterator.java:149) at org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.onNext(AbstractWalRecordsIterator.java:115)
[jira] [Created] (IGNITE-9082) Throwing checked exception during tx commit without node stopping leads to data corruption
Pavel Kovalenko created IGNITE-9082: --- Summary: Throwing checked exception during tx commit without node stopping leads to data corruption Key: IGNITE-9082 URL: https://issues.apache.org/jira/browse/IGNITE-9082 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.6, 2.5, 2.4 Reporter: Pavel Kovalenko Fix For: 2.7 If we get checked exception during tx commit on a primary node and this exception is not supposed to be handled as NodeStopping OR doesn't lead to node stop using Failure Handler, in this case, we may get data loss on a node which is a backup node for this tx. Possible solution: If we get any checked or unchecked exception during tx commit we should stop this node after that to prevent further data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8904) Add rebalanceThreadPoolSize to nodes configuration consistency check
Pavel Kovalenko created IGNITE-8904: --- Summary: Add rebalanceThreadPoolSize to nodes configuration consistency check Key: IGNITE-8904 URL: https://issues.apache.org/jira/browse/IGNITE-8904 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5, 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.7 If supplier node has less thread-pool size than demander node, rebalance process between them will hang indefinitely. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8848) Introduce new split-brain tests when topology is under load
Pavel Kovalenko created IGNITE-8848: --- Summary: Introduce new split-brain tests when topology is under load Key: IGNITE-8848 URL: https://issues.apache.org/jira/browse/IGNITE-8848 Project: Ignite Issue Type: Improvement Components: cache, zookeeper Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 We should check following cases: 1) Primary node of transaction located at a part of a cluster that will survive, while backup doesn't. 2) Backup node of transaction located at a part of a cluster that will survive, while primary doesn't. 3) A client has a connection to both split-brained parts. 4) A client has a connection to only 1 part of a split cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8844) Provide example how to implement auto-activation policy when cluster is activated first time
Pavel Kovalenko created IGNITE-8844: --- Summary: Provide example how to implement auto-activation policy when cluster is activated first time Key: IGNITE-8844 URL: https://issues.apache.org/jira/browse/IGNITE-8844 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5, 2.4 Reporter: Pavel Kovalenko Fix For: 2.6 Some of the our users which use Ignite embedded face with the problem how to activate cluster first time, when no first baseline established. We should provide an example of such policy as we did it with BaselineWatcher. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8835) Do not skip distributed phase of 2-phase partition release if there are some caches to stop / modify
Pavel Kovalenko created IGNITE-8835: --- Summary: Do not skip distributed phase of 2-phase partition release if there are some caches to stop / modify Key: IGNITE-8835 URL: https://issues.apache.org/jira/browse/IGNITE-8835 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 If we don't perform distributed 2-phase in case of cache stop, we can lost some transactional updates from primary to backup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8793) Introduce metrics for File I/O operations to monitor disk performance
Pavel Kovalenko created IGNITE-8793: --- Summary: Introduce metrics for File I/O operations to monitor disk performance Key: IGNITE-8793 URL: https://issues.apache.org/jira/browse/IGNITE-8793 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.6 It would be good to introduce some kind of wrapper for File I/O to measure read/write times for better understanding what is happening with persistence. Measurements should be exposed as JMX-metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8791) IgnitePdsTxCacheRebalancingTest.testTopologyChangesWithConstantLoad fails on TC
Pavel Kovalenko created IGNITE-8791: --- Summary: IgnitePdsTxCacheRebalancingTest.testTopologyChangesWithConstantLoad fails on TC Key: IGNITE-8791 URL: https://issues.apache.org/jira/browse/IGNITE-8791 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 {noformat} junit.framework.AssertionFailedError: 46 8204 expected: but was: {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8785) Node may hang indefinitely in CONNECTING state during cluster segmentation
Pavel Kovalenko created IGNITE-8785: --- Summary: Node may hang indefinitely in CONNECTING state during cluster segmentation Key: IGNITE-8785 URL: https://issues.apache.org/jira/browse/IGNITE-8785 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.6 Affected test: org.apache.ignite.internal.processors.cache.IgniteTopologyValidatorGridSplitCacheTest#testTopologyValidatorWithCacheGroup Node hangs with following stacktrace: {noformat} "grid-starter-testTopologyValidatorWithCacheGroup-22" #117619 prio=5 os_prio=0 tid=0x7f17dd19b800 nid=0x304a in Object.wait() [0x7f16b19df000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:931) - locked <0x000705ee4a60> (a java.lang.Object) at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:373) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:1948) at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:915) at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1739) at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1046) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2014) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1723) - locked <0x000705995ec0> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance) at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1151) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:649) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:882) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:845) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:833) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:799) at org.apache.ignite.testframework.junits.GridAbstractTest$3.call(GridAbstractTest.java:742) at org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86) {noformat} It seems that node never receives acknowledgment from coordinator. There were some failure before: {noformat} [org.apache.ignite:ignite-core] [2018-06-10 04:59:18,876][WARN ][grid-starter-testTopologyValidatorWithCacheGroup-22][IgniteCacheTopologySplitAbstractTest$SplitTcpDiscoverySpi] Node has not been connected to topology and will repeat join process. Check remote nodes logs for possible error messages. Note that large topology may require significant time to start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if getting this message on the starting nodes [networkTimeout=5000] {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8784) Deadlock during simultaneous client reconnect and node stop
Pavel Kovalenko created IGNITE-8784: --- Summary: Deadlock during simultaneous client reconnect and node stop Key: IGNITE-8784 URL: https://issues.apache.org/jira/browse/IGNITE-8784 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.6 {noformat} [18:48:22,665][ERROR][tcp-client-disco-msg-worker-#467%client%][IgniteKernal%client] Failed to reconnect, will stop node class org.apache.ignite.IgniteException: Failed to wait for local node joined event (grid is stopping). at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2193) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.onKernalStart(GridCachePartitionExchangeManager.java:583) at org.apache.ignite.internal.processors.cache.GridCacheSharedContext.onReconnected(GridCacheSharedContext.java:396) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.onReconnected(GridCacheProcessor.java:1159) at org.apache.ignite.internal.IgniteKernal.onReconnected(IgniteKernal.java:3915) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:830) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery(GridDiscoveryManager.java:589) at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2423) at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2402) at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processNodeAddFinishedMessage(ClientImpl.java:2047) at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.processDiscoveryMessage(ClientImpl.java:1896) at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1788) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) Caused by: class org.apache.ignite.IgniteCheckedException: Failed to wait for local node joined event (grid is stopping). at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.onKernalStop0(GridDiscoveryManager.java:1657) at org.apache.ignite.internal.managers.GridManagerAdapter.onKernalStop(GridManagerAdapter.java:652) at org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:2218) at org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:2166) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2588) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2551) at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:372) at org.apache.ignite.Ignition.stop(Ignition.java:229) at org.apache.ignite.testframework.junits.GridAbstractTest.stopGrid(GridAbstractTest.java:1088) at org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1128) at org.apache.ignite.testframework.junits.GridAbstractTest.stopAllGrids(GridAbstractTest.java:1109) at org.gridgain.grid.internal.processors.cache.database.IgniteDbSnapshotNotStableTopologiesTest.afterTest(IgniteDbSnapshotNotStableTopologiesTest.java:250) at org.apache.ignite.testframework.junits.GridAbstractTest.tearDown(GridAbstractTest.java:1694) at org.apache.ignite.testframework.junits.common.GridCommonAbstractTest.tearDown(GridCommonAbstractTest.java:492) at junit.framework.TestCase.runBare(TestCase.java:146) at junit.framework.TestResult$1.protect(TestResult.java:122) at junit.framework.TestResult.runProtected(TestResult.java:142) at junit.framework.TestResult.run(TestResult.java:125) at junit.framework.TestCase.run(TestCase.java:129) at junit.framework.TestSuite.runTest(TestSuite.java:255) at junit.framework.TestSuite.run(TestSuite.java:250) at junit.framework.TestSuite.runTest(TestSuite.java:255) at junit.framework.TestSuite.run(TestSuite.java:250) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at
[jira] [Created] (IGNITE-8780) File I/O operations must be retried if buffer hasn't read/written completely
Pavel Kovalenko created IGNITE-8780: --- Summary: File I/O operations must be retried if buffer hasn't read/written completely Key: IGNITE-8780 URL: https://issues.apache.org/jira/browse/IGNITE-8780 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Fix For: 2.6 Currently we don't actually ensure that we write or read some buffer completely: org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#writeCheckpointEntry org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager#nodeStart As result we may not write to the disk actual data and after node restart we can get BufferUnderflowException, like this: {noformat} java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:506) at java.nio.HeapByteBuffer.getLong(HeapByteBuffer.java:412) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readPointer(GridCacheDatabaseSharedManager.java:1915) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointStatus(GridCacheDatabaseSharedManager.java:1892) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:565) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.start0(GridCacheDatabaseSharedManager.java:525) at org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.start(GridCacheSharedManagerAdapter.java:61) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.start(GridCacheProcessor.java:700) at org.apache.ignite.internal.IgniteKernal.startProcessor(IgniteKernal.java:1738) at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:985) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2014) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1723) at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1151) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:671) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:596) at org.apache.ignite.Ignition.start(Ignition.java:327) at org.apache.ignite.ci.db.TcHelperDb.start(TcHelperDb.java:67) at org.apache.ignite.ci.web.CtxListener.contextInitialized(CtxListener.java:37) at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:890) at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:532) at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:853) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:344) at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1501) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1463) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:785) at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:261) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:545) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131) at org.eclipse.jetty.server.Server.start(Server.java:452) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113) at org.eclipse.jetty.server.Server.doStart(Server.java:419) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.apache.ignite.ci.web.Launcher.runServer(Launcher.java:68) at org.apache.ignite.ci.TcHelperJettyLauncher.main(TcHelperJettyLauncher.java:10) {noformat} and node become into unrecoverable state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8750) IgniteWalFlushDefaultSelfTest.testFailAfterStart fails on TC
Pavel Kovalenko created IGNITE-8750: --- Summary: IgniteWalFlushDefaultSelfTest.testFailAfterStart fails on TC Key: IGNITE-8750 URL: https://issues.apache.org/jira/browse/IGNITE-8750 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 {noformat} org.apache.ignite.IgniteException: Failed to get object field [obj=GridCacheSharedManagerAdapter [starting=true, stop=false], fieldNames=[mmap]] Caused by: java.lang.NoSuchFieldException: mmap {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8691) Get rid of tests jar artifact in ignite-zookeeper module
Pavel Kovalenko created IGNITE-8691: --- Summary: Get rid of tests jar artifact in ignite-zookeeper module Key: IGNITE-8691 URL: https://issues.apache.org/jira/browse/IGNITE-8691 Project: Ignite Issue Type: Bug Components: zookeeper Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 Currently Ignite building process produces {noformat} org/apache/ignite/ignite-zookeeper/2.X.X/ignite-zookeeper-2.X.X-tests.jar {noformat} artifact which seems to be useless and should be excluded as result of packaging. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8690) Missed package-info for some packages
Pavel Kovalenko created IGNITE-8690: --- Summary: Missed package-info for some packages Key: IGNITE-8690 URL: https://issues.apache.org/jira/browse/IGNITE-8690 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 List of affected packages: {noformat} org.apache.ignite.spi.communication.tcp.internal org.apache.ignite.spi.discovery.zk org.apache.ignite.spi.discovery.zk.internal org.apache.ignite.ml.structures.partition {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8688) Pending tree is initialized outside of checkpoint lock
Pavel Kovalenko created IGNITE-8688: --- Summary: Pending tree is initialized outside of checkpoint lock Key: IGNITE-8688 URL: https://issues.apache.org/jira/browse/IGNITE-8688 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Andrew Mashenkov Fix For: 2.6 This may lead to possible page corruption. {noformat} handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.AssertionError]] [00:11:56]W: [org.gridgain:gridgain-compatibility] java.lang.AssertionError [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.allocatePage(PageMemoryImpl.java:463) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.allocateForTree(IgniteCacheOffheapManagerImpl.java:818) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.initPendingTree(IgniteCacheOffheapManagerImpl.java:164) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.onCacheStarted(IgniteCacheOffheapManagerImpl.java:151) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.CacheGroupContext.onCacheStarted(CacheGroupContext.java:283) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.GridCacheProcessor.prepareCacheStart(GridCacheProcessor.java:1965) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:791) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:946) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:651) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2458) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2338) [00:11:56]W: [org.gridgain:gridgain-compatibility] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [00:11:56]W: [org.gridgain:gridgain-compatibility] at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8610) Searching checkpoint / WAL history for rebalancing is not properly working in case of local/global WAL disabling
Pavel Kovalenko created IGNITE-8610: --- Summary: Searching checkpoint / WAL history for rebalancing is not properly working in case of local/global WAL disabling Key: IGNITE-8610 URL: https://issues.apache.org/jira/browse/IGNITE-8610 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 After implementation IGNITE-6411 and IGNITE-8087 we can face with situation when after some checkpoint, WAL was temporarily disabled and enabled again. In this case we can't treat such checkpoint as start point to rebalance, because WAL history after such checkpoint may contain gaps. We should rework our checkpoint / wal history searching mechanism and ignore such checkpoints. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8544) WAL disabling during rebalance mechanism uses wrong topology version in case of exchanges merge
Pavel Kovalenko created IGNITE-8544: --- Summary: WAL disabling during rebalance mechanism uses wrong topology version in case of exchanges merge Key: IGNITE-8544 URL: https://issues.apache.org/jira/browse/IGNITE-8544 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 After exchange is done, we're using initial exchange version to determine topology version on what rebalance should be finished and save it. After rebalance finishing we check current topology version and saved version and if they are equal, we enable WAL, own partitions and do checkpoint. In other case we do nothing, because of topology change. In case of exchanges merge we're saving old topology version (before merge) and it leads to ignoring logic of enabling WAL and etc, because check on topology version will be always false-positive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8527) Show actual rebalance starting in logs
Pavel Kovalenko created IGNITE-8527: --- Summary: Show actual rebalance starting in logs Key: IGNITE-8527 URL: https://issues.apache.org/jira/browse/IGNITE-8527 Project: Ignite Issue Type: Improvement Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko We should increase level of logging from DEBUG to INFO for message: {noformat} if (log.isDebugEnabled()) log.debug("Requested rebalancing [from node=" + node.id() + ", listener index=" + topicId + " " + demandMsg.rebalanceId() + ", partitions count=" + stripePartitions.get(topicId).size() + " (" + stripePartitions.get(topicId).partitionsList() + ")]"); {noformat} to have actual rebalancing start time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8482) Skip 2-phase partition release wait in case of activation or dynamic caches start
Pavel Kovalenko created IGNITE-8482: --- Summary: Skip 2-phase partition release wait in case of activation or dynamic caches start Key: IGNITE-8482 URL: https://issues.apache.org/jira/browse/IGNITE-8482 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 Currently we perform 2-phase partitions release waiting on any type of distributed exchange. We can optimize this behaviour, skipping such waiting on cluster activation (obviously if we activate cluster it means that before activation no caches were running and there is no reason to wait for finishing operations) and on dynamic caches start. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8459) Searching checkpoint history for WAL rebalance is broken
Pavel Kovalenko created IGNITE-8459: --- Summary: Searching checkpoint history for WAL rebalance is broken Key: IGNITE-8459 URL: https://issues.apache.org/jira/browse/IGNITE-8459 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Currently the mechanism to search available checkpoint records in WAL to have history for WAL rebalance is broken. It means that WAL (Historical) rebalance will never find history for rebalance and full rebalance will be always used. This mechanism was broken in https://github.com/apache/ignite/commit/ec04cd174ed5476fba83e8682214390736321b37 by unclear reasons. If we swap the following two code blocks (database().beforeExchange() and exchCtx if block): {noformat} /* It is necessary to run database callback before all topology callbacks. In case of persistent store is enabled we first restore partitions presented on disk. We need to guarantee that there are no partition state changes logged to WAL before this callback to make sure that we correctly restored last actual states. */ cctx.database().beforeExchange(this); if (!exchCtx.mergeExchanges()) { for (CacheGroupContext grp : cctx.cache().cacheGroups()) { if (grp.isLocal() || cacheGroupStopping(grp.groupId())) continue; // It is possible affinity is not initialized yet if node joins to cluster. if (grp.affinity().lastVersion().topologyVersion() > 0) grp.topology().beforeExchange(this, !centralizedAff && !forceAffReassignment, false); } } {noformat} the searching mechanism will start to work correctly. Currently it's unclear why it's happened. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8422) Zookeeper discovery split brain detection shouldn't consider client nodes
Pavel Kovalenko created IGNITE-8422: --- Summary: Zookeeper discovery split brain detection shouldn't consider client nodes Key: IGNITE-8422 URL: https://issues.apache.org/jira/browse/IGNITE-8422 Project: Ignite Issue Type: Bug Components: zookeeper Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 Currently Zookeeper discovery checks each splitted cluster on full connectivity taking into account client nodes. This is not correct, because server and client nodes may use different networks to connect to each other. It means that there can be client which sees both parts of splitted cluster and breaks split brain recovery - full connected part of server nodes will be never find. We should exclude client nodes from split brain analysis and improve split brain tests to make them truly fair. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8415) Manual cache().rebalance() invocation may cancel currently running rebalance
Pavel Kovalenko created IGNITE-8415: --- Summary: Manual cache().rebalance() invocation may cancel currently running rebalance Key: IGNITE-8415 URL: https://issues.apache.org/jira/browse/IGNITE-8415 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.6 If historical rebalance happens and during this process we manually invoke {noformat} Ignite.cache(CACHE_NAME).rebalance().get(); {noformat} then currently running rebalance will be cancelled and started new which seems not right way. Moreover, after new rebalance finish we can lost some data in case of rebalancing entry removes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8405) Sql query may see intermediate results of topology changes and do mapping incorrectly
Pavel Kovalenko created IGNITE-8405: --- Summary: Sql query may see intermediate results of topology changes and do mapping incorrectly Key: IGNITE-8405 URL: https://issues.apache.org/jira/browse/IGNITE-8405 Project: Ignite Issue Type: Bug Components: cache, sql Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Affected test: IgniteStableBaselineCacheQueryNodeRestartsSelfTest Sql query do mapping in following way: 1) If there is at least 1 moving partition query will be mapped to current partition owners 2) In other case affinity mapping will be used. In case of first approach query may see not final partition state if mapping happens during PME. There is "setOwners()" method which does partition movement one-by-one, each time obtaining topology write lock. If query mapping happens in this time it may see that there is some moving partition and performed mapping to OWNING partition which will be moved to MOVING on next "setOwners()" invocation. As result we may query from invalid partitions. As intermediate solution "setOwners()" method should be refactored in a batch way to perform ALL partitions state changes to MOVING in one operation. As general solution query mapping should be revisited, especially "isPreloadingActive" method, to take into account given topology version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8392) Removing WAL history directory leads to JVM crush on that node.
Pavel Kovalenko created IGNITE-8392: --- Summary: Removing WAL history directory leads to JVM crush on that node. Key: IGNITE-8392 URL: https://issues.apache.org/jira/browse/IGNITE-8392 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Environment: Ubuntu 17.10 Oracle JVM Server (1.8.0_151-b12) Reporter: Pavel Kovalenko Fix For: 2.6 Problem: 1) Start node, load some data, deactivate cluster 2) Remove WAL history directory. 3) Activate cluster. Cluster activation will be failed due to JVM crush like this: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7feda1052526, pid=29331, tid=0x7fed193d7700 # # JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jshort_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --- T H R E A D --- Current thread (0x7fec8b202800): JavaThread "db-checkpoint-thread-#243%wal.IgniteWalRebalanceTest0%" [_thread_in_Java, id=29655, stack(0x7fed192d7000,0x7fed193d8000)] siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x7fed198ee0b2 Registers: RAX=0x0007710a9f28, RBX=0x000120b2, RCX=0x0800, RDX=0xfe08 RSP=0x7fed193d5c60, RBP=0x7fed193d5c60, RSI=0x7fed198ef0aa, RDI=0x0007710a9f20 R8 =0x1000, R9 =0x000120b2, R10=0x7feda1052da0, R11=0x1004 R12=0x, R13=0x0007710a9f28, R14=0x1000, R15=0x7fec8b202800 RIP=0x7feda1052526, EFLAGS=0x00010282, CSGSFS=0x002b0033, ERR=0x0006 TRAPNO=0x000e Top of Stack: (sp=0x7fed193d5c60) 0x7fed193d5c60: 0007710a9f28 7feda1be314f 0x7fed193d5c70: 00010002 7feda17747fd 0x7fed193d5c80: a8008c96 7feda11cfb3e 0x7fed193d5c90: 0x7fed193d5ca0: 0x7fed193d5cb0: 0x7fed193d5cc0: 0007710a9f28 7feda1fb37e0 0x7fed193d5cd0: 0007710a8ef0 00076fa5f5c0 0x7fed193d5ce0: 0007710a9f28 0007710a8ef0 0x7fed193d5cf0: 0007710a8ef0 7fed193d5d18 0x7fed193d5d00: 7fedb8428c76 0x7fed193d5d10: 1014 00076fa5f650 0x7fed193d5d20: f8043261 7feda1ee597c 0x7fed193d5d30: 00076fa5f5a8 0007710a9f28 0x7fed193d5d40: 0007710a8ef0 000120a2 0x7fed193d5d50: 00012095 1021 0x7fed193d5d60: edf4bec3 0001209e 0x7fed193d5d70: 0007710a9f28 00076fa5f650 0x7fed193d5d80: 7fed193d5da8 1014 0x7fed193d5d90: 0007710a8ef0 7fed198dc000 0x7fed193d5da0: 00076fa5f650 7feda1b7a040 0x7fed193d5db0: 0007710a9f28 00076fa700d0 0x7fed193d5dc0: 0007710a9f68 ee2153e5f8043261 0x7fed193d5dd0: 0007710a8ef0 0007710a9f98 0x7fed193d5de0: 00012095 0007710a9f28 0x7fed193d5df0: 1fa0 0x7fed193d5e00: 0x7fed193d5e10: 0007710a8ef0 7feda2001530 0x7fed193d5e20: 0007710a8ef0 00076f7c05e8 0x7fed193d5e30: edef80bd 0x7fed193d5e40: 0x7fed193d5e50: 7fedb2266000 7feda1cb1f8c Instructions: (pc=0x7feda1052526) 0x7feda1052506: 00 00 74 08 66 8b 47 08 66 89 46 08 48 33 c0 c9 0x7feda1052516: c3 66 0f 1f 84 00 00 00 00 00 c5 fe 6f 44 d7 c8 0x7feda1052526: c5 fe 7f 44 d6 c8 c5 fe 6f 4c d7 e8 c5 fe 7f 4c 0x7feda1052536: d6 e8 48 83 c2 08 7e e2 48 83 ea 04 7f 10 c5 fe Register to memory mapping: RAX=0x0007710a9f28 is an oop java.nio.DirectByteBuffer - klass: 'java/nio/DirectByteBuffer' RBX=0x000120b2 is an unknown value RCX=0x0800 is an unknown value RDX=0xfe08 is an unknown value RSP=0x7fed193d5c60 is pointing into the stack for thread: 0x7fec8b202800 RBP=0x7fed193d5c60 is pointing into the stack for thread: 0x7fec8b202800 RSI=0x7fed198ef0aa is an unknown value RDI=0x0007710a9f20 is an oop {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging
Pavel Kovalenko created IGNITE-8391: --- Summary: Removing some WAL history segments leads to WAL rebalance hanging Key: IGNITE-8391 URL: https://issues.apache.org/jira/browse/IGNITE-8391 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.6 Problem: 1) Start 2 nodes, load some data to it. 2) Stop node 2, load some data to cache. 3) Remove WAL archived segment which doesn't contain Checkpoint record needed to find start point for WAL rebalance, but contains necessary data for rebalancing. 4) Start node 2, this node will start rebalance data from node 1 using WAL. Rebalance will be hanged with following assertion: {noformat} java.lang.AssertionError: Partitions after rebalance should be either done or missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} This happened because we never reached necessary data and updateCounters contained in removed WAL segment. To resolve such problems we should introduce some fallback strategy if rebalance by WAL has been failed. Example of fallback strategy is - re-run full rebalance for partitions that were not able properly rebalanced using WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8390) WAL historical rebalance is not able to process cache.remove() updates
Pavel Kovalenko created IGNITE-8390: --- Summary: WAL historical rebalance is not able to process cache.remove() updates Key: IGNITE-8390 URL: https://issues.apache.org/jira/browse/IGNITE-8390 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko WAL historical rebalance fails on supplier when process entry remove with following assertion: {noformat} java.lang.AssertionError: GridCacheEntryInfo [key=KeyCacheObjectImpl [part=-1, val=2, hasValBytes=true], cacheId=94416770, val=null, ttl=0, expireTime=0, ver=GridCacheVersion [topVer=136155335, order=1524675346187, nodeOrder=1], isNew=false, deleted=false] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplyMessage.addEntry0(GridDhtPartitionSupplyMessage.java:220) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:381) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Obviously this assertion will work correctly only for full rebalance. We should either soft assertion for historical rebalance case or disable it. In case of disabled assertion everything works well and rebalance finished properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8339) After cluster activation actual partition state restored from WAL may be lost
Pavel Kovalenko created IGNITE-8339: --- Summary: After cluster activation actual partition state restored from WAL may be lost Key: IGNITE-8339 URL: https://issues.apache.org/jira/browse/IGNITE-8339 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 On cluster activation we restore partition states from checkpoint and WAL. But before that action we pre-create partitions by ideal assignment on "beforeExchange" phase and own it in case of first or next activation. This partition state change is logged to WAL and override actual last state of partition during restore. Possible solutions: 1) Pre-create partitions after actual restore. 2) Do not log to WAL partition own on pre-create phase. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8338) Cache operations hang after cluster deactivation and activation again
Pavel Kovalenko created IGNITE-8338: --- Summary: Cache operations hang after cluster deactivation and activation again Key: IGNITE-8338 URL: https://issues.apache.org/jira/browse/IGNITE-8338 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Fix For: 2.6 Problem: 1) Start several nodes 2) Activate cluster 3) Run cache load 4) Deactivate cluster 5) Activate again After second activation cache operations hang with following stacktrace: {noformat} "cache-load-2" #210 prio=5 os_prio=0 tid=0x7efbb401b800 nid=0x602b waiting on condition [0x7efb809b3000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.publicJCache(GridCacheProcessor.java:3782) at org.apache.ignite.internal.processors.cache.GridCacheProcessor.publicJCache(GridCacheProcessor.java:3753) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.checkProxyIsValid(GatewayProtectedCacheProxy.java:1486) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.onEnter(GatewayProtectedCacheProxy.java:1508) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:785) at org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTestWithPersistence.lambda$testDeactivateDuringEviction$0(IgniteClusterActivateDeactivateTestWithPersistence.java:316) at org.apache.ignite.internal.processors.cache.IgniteClusterActivateDeactivateTestWithPersistence$$Lambda$39/832408842.run(Unknown Source) at org.apache.ignite.testframework.GridTestUtils$6.call(GridTestUtils.java:1254) at org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86) {noformat} It seems, dynamicStartCache future never completes after second activation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8324) Ignite Cache Restarts 1 suite hangs with assertion error
Pavel Kovalenko created IGNITE-8324: --- Summary: Ignite Cache Restarts 1 suite hangs with assertion error Key: IGNITE-8324 URL: https://issues.apache.org/jira/browse/IGNITE-8324 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 {noformat} [ERROR][exchange-worker-#620749%replicated.GridCacheReplicatedNodeRestartSelfTest0%][GridDhtPartitionsExchangeFuture] Failed to notify listener: o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2@6dd7cc93 java.lang.AssertionError: Invalid topology version [grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=323, minorTopVer=0], exchTopVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], discoCacheVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], exchDiscoCacheVer=AffinityTopologyVersion [topVer=323, minorTopVer=0], fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, isClient=false], topVer=322, nodeId8=b51b3893, msg=Node joined: TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, isClient=false], type=NODE_JOINED, tstamp=1524043684166], crd=TcpDiscoveryNode [id=b51b3893-377a-465f-88ea-316a6560, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1524043633288, loc=true, ver=2.5.0#20180417-sha1:56be24b9, isClient=false], exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], discoEvt=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, isClient=false], topVer=322, nodeId8=b51b3893, msg=Node joined: TcpDiscoveryNode [id=48a5d243-7f63-4069-aba1-868c6895, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47503], discPort=47503, order=322, intOrder=163, lastExchangeTime=1524043684082, loc=false, ver=2.5.0#20180417-sha1:56be24b9, isClient=false], type=NODE_JOINED, tstamp=1524043684166], nodeId=48a5d243, evt=NODE_JOINED], added=true, initFut=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=true, hash=527135060], init=true, lastVer=GridCacheVersion [topVer=135523955, order=1524043694535, nodeOrder=3], partReleaseFut=PartitionReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[ExplicitLockReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], AtomicUpdateReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], DataStreamerReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], LocalTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], AllTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[RemoteTxReleaseFuture [topVer=AffinityTopologyVersion [topVer=322, minorTopVer=0], futures=[]], exchActions=null, affChangeMsg=null, initTs=1524043684166, centralizedAff=false, forceAffReassignment=false, changeGlobalStateE=null, done=false, state=CRD, evtLatch=0, remaining=[], super=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1570781250]]] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.updateTopologyVersion(GridDhtPartitionTopologyImpl.java:257) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updateTopologies(GridDhtPartitionsExchangeFuture.java:845) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:2461) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2200) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:127) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2057) at
[jira] [Created] (IGNITE-8313) Trace logs enhancement for exchange and affinity calculation
Pavel Kovalenko created IGNITE-8313: --- Summary: Trace logs enhancement for exchange and affinity calculation Key: IGNITE-8313 URL: https://issues.apache.org/jira/browse/IGNITE-8313 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.6 For better problems debugging we should add more trace logs to following places: 1) Partition states before and after exchange. 2) Affinity distribution for each topology version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8218) Add exchange latch state to diagnostic messages
Pavel Kovalenko created IGNITE-8218: --- Summary: Add exchange latch state to diagnostic messages Key: IGNITE-8218 URL: https://issues.apache.org/jira/browse/IGNITE-8218 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.5 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8122) Partition state restored from WAL may be lost if no checkpoints are done
Pavel Kovalenko created IGNITE-8122: --- Summary: Partition state restored from WAL may be lost if no checkpoints are done Key: IGNITE-8122 URL: https://issues.apache.org/jira/browse/IGNITE-8122 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 Problem: 1) Start several nodes with enabled persistence. 2) Make sure that all partitions for 'ignite-sys-cache' have status OWN on all nodes and appropriate PartitionMetaStateRecord record is logged to WAL 3) Stop all nodes and start again, activate cluster. Checkpoint for 'ignite-sys-cache' is empty, because there were no data in cache. 4) State for all partitions will be restored to OWN (GridCacheDatabaseSharedManager#restoreState) from WAL, but not recorded to page memory, because there were no checkpoints and data in cache. Store manager is not properly initialized for such partitions. 5) On exchange done we're trying to restore states of partitions (initPartitionsWhenAffinityReady) on all nodes. Because page memory is empty, states of all partitions will be restored to MOVING by default. 6) All nodes start to rebalance partitions from each other and this process become unpredictable because we're trying to rebalance from MOVING partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8063) Transaction rollback is unmanaged in case when commit produced Runtime exception
Pavel Kovalenko created IGNITE-8063: --- Summary: Transaction rollback is unmanaged in case when commit produced Runtime exception Key: IGNITE-8063 URL: https://issues.apache.org/jira/browse/IGNITE-8063 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 When 'userCommit' produces an runtime exception transaction state is moved to UNKNOWN, and tx.finishFuture() completes, after that rollback process runs asynchronously and there is no simple way to await rollback completion on such transactions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8062) Add ability to properly wait for transaction finish in case of PRIMARY_SYNC cache mode
Pavel Kovalenko created IGNITE-8062: --- Summary: Add ability to properly wait for transaction finish in case of PRIMARY_SYNC cache mode Key: IGNITE-8062 URL: https://issues.apache.org/jira/browse/IGNITE-8062 Project: Ignite Issue Type: Improvement Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 Currently GridDhtTxFinishFuture may be finished ahead of time in case of PRIMARY_SYNC mode and there is no way to properly wait for such futures finishing on remote nodes. We should introduce ability to wait for full transaction completion in such cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7987) Affinity may be not calculated properly in case of merged exchanges with client nodes
Pavel Kovalenko created IGNITE-7987: --- Summary: Affinity may be not calculated properly in case of merged exchanges with client nodes Key: IGNITE-7987 URL: https://issues.apache.org/jira/browse/IGNITE-7987 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.5 Currently we pass only last (or first in some cases) discovery event for affinity calculation at GridAffinityAssignmentCache. Affinity calculation can be skipped if such discovery event belongs to client node or node filtered by nodeFilter for optimization issues (because affinity will not be changed in such case). Since we have exchange merging there can be several discovery events corresponds to one exchange. Passing only first or last event for affinity calculation is wrong, because calculation can be skipped, while exchange actually contains events changing affinity. Instead of first/last event we should pass whole collection of discovery events (ExchangeDiscoveryEvents) and skip affinity calculation for a group only when ALL events doesn't change affinity for such group. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7946) IgniteCacheClientQueryReplicatedNodeRestartSelfTest#testRestarts can hang on TC
Pavel Kovalenko created IGNITE-7946: --- Summary: IgniteCacheClientQueryReplicatedNodeRestartSelfTest#testRestarts can hang on TC Key: IGNITE-7946 URL: https://issues.apache.org/jira/browse/IGNITE-7946 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko According to test logs there can be unfinished rebalance: {noformat} [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,327][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander] Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion [topVer=103, minorTopVer=0]] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,327][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander] Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=pr], topVer=AffinityTopologyVersion [topVer=103, minorTopVer=0], rebalanceId=1] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridCachePartitionExchangeManager] Rebalancing scheduled [order=[pe, pr]] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridCachePartitionExchangeManager] Rebalancing started [top=AffinityTopologyVersion [topVer=104, minorTopVer=0], evt=NODE_LEFT, node=04d02ea1-286c-4d8c-8870-e147c552] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander] Starting rebalancing [grp=pe, mode=SYNC, fromNode=31193890-bf8f-4c85-af76-342efb31, partitionsCount=15, topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], rebalanceId=2] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO ][exchange-worker-#456665%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander] Starting rebalancing [grp=pe, mode=SYNC, fromNode=517f4efb-4433-489a-8c8e-e91f9e70, partitionsCount=16, topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], rebalanceId=2] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,328][INFO ][exchange-worker-#455983%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest0%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=104, minorTopVer=0], evt=NODE_LEFT, node=04d02ea1-286c-4d8c-8870-e147c552] [01:21:23] : [Step 4/5] [2018-03-13 22:21:23,332][INFO ][sys-#456730%near.IgniteCacheClientQueryReplicatedNodeRestartSelfTest3%][GridDhtPartitionDemander] Completed rebalancing [fromNode=517f4efb-4433-489a-8c8e-e91f9e70, cacheOrGroup=pe, topology=AffinityTopologyVersion [topVer=104, minorTopVer=0], time=0 ms] {noformat} It can be cause of test hanging. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7898) IgniteCachePartitionLossPolicySelfTest is flaky on TC
Pavel Kovalenko created IGNITE-7898: --- Summary: IgniteCachePartitionLossPolicySelfTest is flaky on TC Key: IGNITE-7898 URL: https://issues.apache.org/jira/browse/IGNITE-7898 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Affected tests: testReadOnlyAll testReadWriteSafe Exception: {code:java} junit.framework.AssertionFailedError: Failed to find expected lost partition [exp=0, lost=[]] at org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.verifyCacheOps(IgniteCachePartitionLossPolicySelfTest.java:219) at org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.checkLostPartition(IgniteCachePartitionLossPolicySelfTest.java:166) at org.apache.ignite.internal.processors.cache.distributed.IgniteCachePartitionLossPolicySelfTest.testReadWriteSafe(IgniteCachePartitionLossPolicySelfTest.java:114) {code} The problem of failure: After we prepare topology and shutdown the node containing lost partition we start to check it immediately on all nodes (cache.lostPartitions() method). Sometimes we invoke this method on client node where last PME is not even started and getting empty list of lost partitions because we haven't received it yet on PME. Possible solution: Wait for PME finishing on all nodes (including client) before start to check for lost partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7882) Atomic update requests should always use topology mappings instead of affinity
Pavel Kovalenko created IGNITE-7882: --- Summary: Atomic update requests should always use topology mappings instead of affinity Key: IGNITE-7882 URL: https://issues.apache.org/jira/browse/IGNITE-7882 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Currently for mapping cache atomic updates we can use two ways: 1) Use nodes reporting status OWNING for partition where we send the update. 2) Use only affinity nodes mapping if rebalance is finished. Using the second way we may route update request only to affinity node, while there is also node which is still owner and can process read requests. It can lead to reading null values for some key, while update for such key was successful a moment ago. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7873) Partition update counters and sizes may be different if cache is using readThrough
Pavel Kovalenko created IGNITE-7873: --- Summary: Partition update counters and sizes may be different if cache is using readThrough Key: IGNITE-7873 URL: https://issues.apache.org/jira/browse/IGNITE-7873 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Tracking partition update counters and cache sizes may not properly work if cache is using readThrough behavior. Read requests to such cache can increment update counters or cache sizes not on all nodes serving such cache in case if data in underlying storage is changed. It means that update counter or cache size will be incremented only on partition where we followed such request (primary or any random node). BackupPostProcessingClosure should use preload=false for entry. In other case it can increment update counter for read request while data is not changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7871) Partition update counters may be different during exchange
Pavel Kovalenko created IGNITE-7871: --- Summary: Partition update counters may be different during exchange Key: IGNITE-7871 URL: https://issues.apache.org/jira/browse/IGNITE-7871 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Using validation implemented in IGNITE-7467 we can observe the following situation: Let's we have some partition and nodes which owning it N1 (primary) and N2 (backup) 1) Exchange is started 2) N2 finished waiting for partitions release and started to create Single message (with update counters). 3) N1 waits for partitions release. 4) We have pending cache update N1 -> N2. This update is done after step 2. 5) This update increments update counters both on N1 and N2. 6) N1 finished waiting for partitions release, while N2 already sent Single message to coordinator with outdated update counter. 7) Coordinator sees different partition update counters for N1 and N2. Validation is failed, while data is equal. Possible solutions: 1) Cancel transactions and atomic updates on backups if topology version on them is already changed (or waiting for partitions release is finished). 2) Each node participating in exchange should wait for partitions release of other nodes not only self (like distributed countdown latch right after waiting for partitions release). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7833) Find out possible ways to handle partition update counters inconsistency
Pavel Kovalenko created IGNITE-7833: --- Summary: Find out possible ways to handle partition update counters inconsistency Key: IGNITE-7833 URL: https://issues.apache.org/jira/browse/IGNITE-7833 Project: Ignite Issue Type: Improvement Components: cache Reporter: Pavel Kovalenko We should think about possible ways to resolve the situation when we observe that partition update counters for the same partitions (primary-backup) are different on some nodes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7795) Correct handling partitions restored in RENTING state
Pavel Kovalenko created IGNITE-7795: --- Summary: Correct handling partitions restored in RENTING state Key: IGNITE-7795 URL: https://issues.apache.org/jira/browse/IGNITE-7795 Project: Ignite Issue Type: Bug Components: cache, persistence Affects Versions: 2.3, 2.2, 2.1, 2.4 Reporter: Pavel Kovalenko Fix For: 2.5 Let's we have node which has partition in state RENTING after start. It could happen if node was stopped during partition eviction. Started up node is only one owner by affinity for this partition. Currently we will own this partition during rebalance preparing phase which seems is not correct. If we don't have owners for some partitions we should fail activation process, move this partition to MOVING state and clear it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7773) Add getRebalanceClearingPartitionsLeft JMX metric to .NET
Pavel Kovalenko created IGNITE-7773: --- Summary: Add getRebalanceClearingPartitionsLeft JMX metric to .NET Key: IGNITE-7773 URL: https://issues.apache.org/jira/browse/IGNITE-7773 Project: Ignite Issue Type: Task Components: platforms Reporter: Pavel Kovalenko Fix For: 2.5 New metric is introduced in IGNITE-6113 . -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7750) testMultiThreadStatisticsEnable is flaky on TC
Pavel Kovalenko created IGNITE-7750: --- Summary: testMultiThreadStatisticsEnable is flaky on TC Key: IGNITE-7750 URL: https://issues.apache.org/jira/browse/IGNITE-7750 Project: Ignite Issue Type: Bug Components: cache Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko {code:java} class org.apache.ignite.IgniteException: Cache not found [cacheName=cache2] at org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:985) at org.apache.ignite.internal.cluster.IgniteClusterImpl.enableStatistics(IgniteClusterImpl.java:497) at org.apache.ignite.internal.processors.cache.CacheMetricsEnableRuntimeTest$3.run(CacheMetricsEnableRuntimeTest.java:181) at org.apache.ignite.testframework.GridTestUtils$9.call(GridTestUtils.java:1275) at org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:86) Caused by: class org.apache.ignite.IgniteCheckedException: Cache not found [cacheName=cache2] at org.apache.ignite.internal.processors.cache.GridCacheProcessor.enableStatistics(GridCacheProcessor.java:4227) at org.apache.ignite.internal.cluster.IgniteClusterImpl.enableStatistics(IgniteClusterImpl.java:494) ... 3 more {code} The problem of the test: 1) We don't wait for exchange future completion after "cache2" is started and it may lead to NullPointerException when we try to obtain reference to "cache2" on the node which doesn't complete exchange future and initialize cache proxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7749) testDiscoCacheReuseOnNodeJoin fails on TC
Pavel Kovalenko created IGNITE-7749: --- Summary: testDiscoCacheReuseOnNodeJoin fails on TC Key: IGNITE-7749 URL: https://issues.apache.org/jira/browse/IGNITE-7749 Project: Ignite Issue Type: Bug Components: cache Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko {code:java} java.lang.ClassCastException: org.apache.ignite.internal.util.GridConcurrentHashSet cannot be cast to java.lang.String at org.apache.ignite.spi.discovery.IgniteDiscoveryCacheReuseSelfTest.assertDiscoCacheReuse(IgniteDiscoveryCacheReuseSelfTest.java:93) at org.apache.ignite.spi.discovery.IgniteDiscoveryCacheReuseSelfTest.testDiscoCacheReuseOnNodeJoin(IgniteDiscoveryCacheReuseSelfTest.java:64) {code} There are 2 problems in the test. 1) We don't wait for final topology version is set on all nodes and start checking discovery caches immediately after grids starting. It leads to possible NullPointerException while accessing to discovery caches history. 2) We don't use explicit assertEquals(String, Object, Object) related to comparing Objects, while Java can choose assertEquals(String, String) method to compare discovery cache fields which we're getting in runtime using reflection. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7717) testAssignmentAfterRestarts is flaky on TC
Pavel Kovalenko created IGNITE-7717: --- Summary: testAssignmentAfterRestarts is flaky on TC Key: IGNITE-7717 URL: https://issues.apache.org/jira/browse/IGNITE-7717 Project: Ignite Issue Type: Bug Reporter: Pavel Kovalenko -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-7500) Partition update counters may be inconsistent after rebalancing
Pavel Kovalenko created IGNITE-7500: --- Summary: Partition update counters may be inconsistent after rebalancing Key: IGNITE-7500 URL: https://issues.apache.org/jira/browse/IGNITE-7500 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.3 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Problem: If partition rebalance requires more than one batch we are not sending `Clear` flag for it in the last supply message and as result don't set updateCounter to right value. Temporary solution: Send `Clear` flags for partitions that were fully rebalanced in the last supply message. But we still have a problem with race conditions with setting updateCounter during concurrent rebalance and cache load. General solution: https://issues.apache.org/jira/browse/IGNITE-6113 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-6029) Refactor WAL Record serialization and introduce RecordV2Serializer
Pavel Kovalenko created IGNITE-6029: --- Summary: Refactor WAL Record serialization and introduce RecordV2Serializer Key: IGNITE-6029 URL: https://issues.apache.org/jira/browse/IGNITE-6029 Project: Ignite Issue Type: Improvement Affects Versions: 2.1 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Fix For: 2.2 Currently RecordSerializer interface and default RecordV1Serializer implementation are not well extendable. We should refactor RecordSerializer interface and introduce new RecordV2Serializer with very base functionality - delegate everything to RecordV1Serializer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (IGNITE-6018) Introduce WAL backward compatibility for new DataPage insert/update records
Pavel Kovalenko created IGNITE-6018: --- Summary: Introduce WAL backward compatibility for new DataPage insert/update records Key: IGNITE-6018 URL: https://issues.apache.org/jira/browse/IGNITE-6018 Project: Ignite Issue Type: Sub-task Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko Priority: Blocker Fix For: 2.2 Once we store reference to DataRecord for DataPage insert/update records we should be able to read/write both versions of that records (with reference or with payload) for backward compatibility purposes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (IGNITE-6017) Ignite IGFS: IgfsStreamsSelfTest#testCreateFileFragmented fails
Pavel Kovalenko created IGNITE-6017: --- Summary: Ignite IGFS: IgfsStreamsSelfTest#testCreateFileFragmented fails Key: IGNITE-6017 URL: https://issues.apache.org/jira/browse/IGNITE-6017 Project: Ignite Issue Type: Bug Affects Versions: 2.1 Reporter: Pavel Kovalenko Priority: Minor Fix For: 2.2 Failure is almost can't be reproduced locally. Suppose it is the same problem as in IGNITE-5957 . {noformat} junit.framework.AssertionFailedError: expected:<2> but was:<1> at junit.framework.Assert.fail(Assert.java:57) at junit.framework.Assert.failNotEquals(Assert.java:329) at junit.framework.Assert.assertEquals(Assert.java:78) at junit.framework.Assert.assertEquals(Assert.java:234) at junit.framework.Assert.assertEquals(Assert.java:241) at junit.framework.TestCase.assertEquals(TestCase.java:409) at org.apache.ignite.internal.processors.igfs.IgfsStreamsSelfTest.testCreateFileFragmented(IgfsStreamsSelfTest.java:264) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:176) at org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:2000) at org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:132) at org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1915) at java.lang.Thread.run(Thread.java:745) Aug 09, 2017 1:15:56 AM org.apache.ignite.logger.java.JavaLogger error SEVERE: DataStreamer operation failed. class org.apache.ignite.IgniteCheckedException: Data streamer has been cancelled: DataStreamerImpl [rcvr=org.apache.ignite.internal.processors.datastreamer.DataStreamerCacheUpdaters$BatchedSorted@13c54950, ioPlcRslvr=null, cacheName=igfs-internal-igfs-data, bufSize=512, parallelOps=16, timeout=-1, autoFlushFreq=0, bufMappings={908d1a4c-b352-4af5-b039-ded60c20=Buffer [node=TcpDiscoveryNode [id=908d1a4c-b352-4af5-b039-ded60c20, addrs=[127.0.0.1], sockAddrs=[/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1502241356486, loc=true, ver=2.2.0#19700101-sha1:, isClient=false], isLocNode=true, idGen=0, sem=java.util.concurrent.Semaphore@2bdbd7f0[Permits = 16], batchTopVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], entriesCnt=1, locFutsSize=0, reqsSize=0]}, cacheObjProc=GridProcessorAdapter [], cacheObjCtx=org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryContext@6e3de40e, cancelled=true, failCntr=0, activeFuts=GridConcurrentHashSet [elements=[GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=431192362], GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=625896337], GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1440203156]]], jobPda=null, depCls=null, fut=DataStreamerFuture [super=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=null, hash=1612913644]], publicFut=IgniteFuture [orig=DataStreamerFuture [super=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=null, hash=1612913644]]], disconnectErr=null, closed=true, lastFlushTime=1502241356435, skipStore=false, keepBinary=false, maxRemapCnt=32, remapSem=java.util.concurrent.Semaphore@194e0ba1[Permits = 2147483647], remapOwning=false] at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:865) at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$5.apply(DataStreamerImpl.java:834) at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:382) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblock(GridFutureAdapter.java:346) at org.apache.ignite.internal.util.future.GridFutureAdapter.unblockAll(GridFutureAdapter.java:334) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:494) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:473) at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:461) at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$Buffer.onNodeLeft(DataStreamerImpl.java:1757) at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$6.run(DataStreamerImpl.java:952) at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6687) at org.apache.ignite.internal.processors.closure.GridClosureProcessor$1.body(GridClosureProcessor.java:817) at