Re: Thin client: transactions support
Sergey, yes, the close is something like silent rollback. But we can also implement this on the client side, just using rollback and ignoring errors in the response. ср, 27 мар. 2019 г. в 00:04, Sergey Kozlov : > Nikolay > > Am I correctly understand you points: > >- close: rollback >- commit, close: do nothing >- rollback, close: do what? (I suppose nothing) > > Also you assume that after commit/rollback we may need to free some > resources on server node(s)or just do on client started TX? > > > > On Tue, Mar 26, 2019 at 10:41 PM Alex Plehanov > wrote: > > > Sergey, we have the close() method in the thick client, it's behavior is > > slightly different than rollback() method (it should rollback if the > > transaction is not committed and do nothing if the transaction is already > > committed). I think we should support try-with-resource semantics in the > > thin client and OP_TX_CLOSE will be useful here. > > > > Nikolay, suspend/resume didn't work yet for pessimistic transactions. > Also, > > the main goal of suspend/resume operations is to support transaction > > passing between threads. In the thin client, the transaction is bound to > > the client connection, not client thread. I think passing a transaction > > between different client connections is not a very useful case. > > > > вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov : > > > > > Hello, Alex. > > > > > > We also have suspend and resume operations. > > > I think we should support them > > > > > > вт, 26 марта 2019 г., 22:07 Sergey Kozlov : > > > > > > > Hi > > > > > > > > Looks like I missed something but why we need OP_TX_CLOSE operation? > > > > > > > > Also I suggest to reserve a code for SAVEPOINT operation which very > > > useful > > > > to understand where transaction has been rolled back > > > > > > > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov < > plehanov.a...@gmail.com > > > > > > > wrote: > > > > > > > > > Hello Igniters! > > > > > > > > > > I want to pick up the ticket IGNITE-7369 and add transactions > support > > > to > > > > > our thin client implementation. > > > > > I've looked at our current implementation and have some proposals > to > > > > > support transactions: > > > > > > > > > > Add new operations to thin client protocol: > > > > > > > > > > OP_TX_GET, 4000, Get current transaction for client connection > > > > > OP_TX_START, 4001, Start a new transaction > > > > > OP_TX_COMMIT, 4002, Commit transaction > > > > > OP_TX_ROLLBACK, 4003, Rollback transaction > > > > > OP_TX_CLOSE, 4004, Close transaction > > > > > > > > > > From the client side (java) new interfaces will be added: > > > > > > > > > > public interface ClientTransactions { > > > > > public ClientTransaction txStart(); > > > > > public ClientTransaction txStart(TransactionConcurrency > > > concurrency, > > > > > TransactionIsolation isolation); > > > > > public ClientTransaction txStart(TransactionConcurrency > > > concurrency, > > > > > TransactionIsolation isolation, long timeout, int txSize); > > > > > public ClientTransaction tx(); // Get current connection > > > transaction > > > > > public ClientTransactions withLabel(String lb); > > > > > } > > > > > > > > > > public interface ClientTransaction extends AutoCloseable { > > > > > public IgniteUuid xid(); // Do we need it? > > > > > public TransactionIsolation isolation(); > > > > > public TransactionConcurrency concurrency(); > > > > > public long timeout(); > > > > > public String label(); > > > > > > > > > > public void commit(); > > > > > public void rollback(); > > > > > public void close(); > > > > > } > > > > > > > > > > From the server side, I think as a first step (while transactions > > > > > suspend/resume is not fully implemented) we can use the same > approach > > > as > > > > > for JDBC: add a new worker to each ClientRequestHandler and process > > > > > requests by this worker if the transaction is started explicitly. > > > > > ClientRequestHandler is bound to client connection, so there will > be > > > 1:1 > > > > > relation between client connection and thread, which process > > operations > > > > in > > > > > a transaction. > > > > > > > > > > Also, there is a couple of issues I want to discuss: > > > > > > > > > > We have overloaded method txStart with a different set of > arguments. > > > Some > > > > > of the arguments may be missing. To pass arguments with OP_TX_START > > > > > operation we have the next options: > > > > > * Serialize full set of arguments and use some value for missing > > > > > arguments. For example -1 for int/long types and null for string > > type. > > > We > > > > > can't use 0 for int/long types since 0 it's a valid value for > > > > concurrency, > > > > > isolation and timeout arguments. > > > > > * Serialize arguments as a collection of property-value pairs > (like > > > it's > > > > > implemented now for CacheConfiguration). In this case only > explicitly > > > > >
Re: GridDhtInvalidPartitionException takes the cluster down
Folks, thanks for sharing details and inputs. This is helpful. As long as I spend a lot of time working with Ignite users, I'll look into this topic in a couple of days to propose some changes. In the meantime, here is a fresh one report on the user list: http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html - Denis On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura wrote: > CleanupWorker termination can lead to the following effects: > > - Queries can retrieve data that have to expired so application will > behave incorrectly. > - Memory and/or disc can be overflowed because entries weren't expired. > - Performance degradation is possible due to unmanageable data set grows. > > On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh > wrote: > > > > Vyacheslav, if you are talking about this particular case I described, I > believe it has no influence on PME. What could happen is having > CleanupWorker thread dead (which is not good too).But I believe we are > talking in a wider scope. > > > > -- Roman > > > > > > On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur < > daradu...@gmail.com> wrote: > > > > In general I agree with Andrey, the handler is very usefull itself. It > > allows us to become know that ‘GridDhtInvalidPartitionException’ is not > > processed properly in PME process by worker. > > > > Nikolay, look at the code, if Failure Handler hadles an exception - this > > means that while-true loop in worker’s body has been interrupted with > > unexpected exception and thread is completed his lifecycle. > > > > Without Failure Hanller, in the current case, the cluster will hang, > > because of unable to participate in PME process. > > > > So, the problem is the incorrect handling of the exception in PME’s task > > wich should be fixed. > > > > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov : > > > > > Nikolay, > > > > > > Feel free to suggest better error messages to indicate > internal/critical > > > failures. User actions in response to critical failures are rather > limited: > > > mail to user-list or maybe file an issue. As for repetitive warnings, > it > > > makes sense, but requires additional stuff to deliver such signals, > mere > > > spamming to log will not have an effect. > > > > > > Anyway, when experienced committers suggest to disable failure > handling and > > > hide existing issues, I feel as if they are pulling my leg. > > > > > > Best regards, > > > Andrey Kuznetsov. > > > > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > > > > > > > Andrey. > > > > > > > > > the thread can be made non-critical, and we can restart it every > time > > > it > > > > dies > > > > > > > > Why we can't restart critical thread? > > > > What is the root difference between critical and non critical > threads? > > > > > > > > > It's much simpler to catch and handle all exceptions in critical > > > threads > > > > > > > > I don't agree with you. > > > > We develop Ignite not because it simple! > > > > We must spend extra time to made it robust and resilient to the > failures. > > > > > > > > > Failure handling is a last-chance tool that reveals internal Ignite > > > > errors > > > > > 100% agree with you: overcome, but not hide. > > > > > > > > Logging stack trace with proper explanation is not hiding. > > > > Killing nodes and whole cluster is not "handling". > > > > > > > > > As far as I see from user-list messages, our users are qualified > enough > > > > to provide necessary information from their cluster-wide logs. > > > > > > > > We shouldn't develop our product only for users who are able to read > > > Ignite > > > > sources to decrypt the fail reason behind "starvation in stripped > pool" > > > > > > > > Some of my questions remain unanswered :) : > > > > > > > > 1. How user can know it's an Ignite bug? Where this bug should be > > > reported? > > > > 2. Do we log it somewhere? > > > > 3. Do we warn user before shutdown several times? > > > > 4. "starvation in stripped pool" I think it's not clear error > message. > > > > Let's make it more specific! > > > > 5. Let's write to the user log - what he or she should do to prevent > this > > > > error in future? > > > > > > > > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > > > > > > > Nikolay, > > > > > > > > > > > Why we can't restart some thread? > > > > > Technically, we can. It's just matter of design: the thread can be > made > > > > > non-critical, and we can restart it every time it dies. But such > design > > > > > looks poor to me. It's much simpler to catch and handle all > exceptions > > > in > > > > > critical threads. Failure handling is a last-chance tool that > reveals > > > > > internal Ignite errors. It's not pleasant for us when users see > these > > > > > errors, but it's better than hiding. > > > > > > > > > > > Actually, distributed systems are designed to overcome some > bugs, > > > > thread > > > > >
[jira] [Created] (IGNITE-11634) SQL delete query failed to deserialize DmlStatementsProcessor$ModifyingEntryProcessor
Roman Guseinov created IGNITE-11634: --- Summary: SQL delete query failed to deserialize DmlStatementsProcessor$ModifyingEntryProcessor Key: IGNITE-11634 URL: https://issues.apache.org/jira/browse/IGNITE-11634 Project: Ignite Issue Type: Bug Components: sql Affects Versions: 2.7 Reporter: Roman Guseinov Assignee: Roman Guseinov Here is a stack trace {code:java} Exception in thread "main" javax.cache.CacheException: Failed to deserialize object [typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor] at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:635) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:574) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.query(GatewayProtectedCacheProxy.java:356) at org.gridgain.reproducers.sql.JavaSqlClient.deleteRow(JavaSqlClient.java:42) at org.gridgain.reproducers.sql.JavaSqlClient.run(JavaSqlClient.java:33) at org.gridgain.reproducers.sql.JavaSqlClient.main(JavaSqlClient.java:28) Caused by: class org.apache.ignite.internal.processors.query.IgniteSQLException: Failed to deserialize object [typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor] at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.doDelete(DmlStatementsProcessor.java:686) at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.processDmlSelectResult(DmlStatementsProcessor.java:587) at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.executeUpdateStatement(DmlStatementsProcessor.java:539) at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.updateSqlFields(DmlStatementsProcessor.java:171) at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.updateSqlFieldsDistributed(DmlStatementsProcessor.java:345) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.doRunPrepared(IgniteH2Indexing.java:1753) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.querySqlFields(IgniteH2Indexing.java:1718) at org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2007) at org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2002) at org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36) at org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2550) at org.apache.ignite.internal.processors.query.GridQueryProcessor.querySqlFields(GridQueryProcessor.java:2016) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:623) ... 5 more Caused by: java.sql.SQLException: Failed to deserialize object [typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor] at org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.processPage(DmlBatchSender.java:225) at org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.sendBatch(DmlBatchSender.java:184) at org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.flush(DmlBatchSender.java:135) at org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.doDelete(DmlStatementsProcessor.java:668) ... 17 more Caused by: class org.apache.ignite.IgniteCheckedException: Failed to deserialize object [typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor] at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10045) at org.apache.ignite.internal.processors.cache.GridCacheMessage.unmarshalCollection(GridCacheMessage.java:650) at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicFullUpdateRequest.finishUnmarshal(GridNearAtomicFullUpdateRequest.java:405) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.unmarshall(GridCacheIoManager.java:1609) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:586) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) at
Re: Thin client: transactions support
Nikolay Am I correctly understand you points: - close: rollback - commit, close: do nothing - rollback, close: do what? (I suppose nothing) Also you assume that after commit/rollback we may need to free some resources on server node(s)or just do on client started TX? On Tue, Mar 26, 2019 at 10:41 PM Alex Plehanov wrote: > Sergey, we have the close() method in the thick client, it's behavior is > slightly different than rollback() method (it should rollback if the > transaction is not committed and do nothing if the transaction is already > committed). I think we should support try-with-resource semantics in the > thin client and OP_TX_CLOSE will be useful here. > > Nikolay, suspend/resume didn't work yet for pessimistic transactions. Also, > the main goal of suspend/resume operations is to support transaction > passing between threads. In the thin client, the transaction is bound to > the client connection, not client thread. I think passing a transaction > between different client connections is not a very useful case. > > вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov : > > > Hello, Alex. > > > > We also have suspend and resume operations. > > I think we should support them > > > > вт, 26 марта 2019 г., 22:07 Sergey Kozlov : > > > > > Hi > > > > > > Looks like I missed something but why we need OP_TX_CLOSE operation? > > > > > > Also I suggest to reserve a code for SAVEPOINT operation which very > > useful > > > to understand where transaction has been rolled back > > > > > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov > > > > wrote: > > > > > > > Hello Igniters! > > > > > > > > I want to pick up the ticket IGNITE-7369 and add transactions support > > to > > > > our thin client implementation. > > > > I've looked at our current implementation and have some proposals to > > > > support transactions: > > > > > > > > Add new operations to thin client protocol: > > > > > > > > OP_TX_GET, 4000, Get current transaction for client connection > > > > OP_TX_START, 4001, Start a new transaction > > > > OP_TX_COMMIT, 4002, Commit transaction > > > > OP_TX_ROLLBACK, 4003, Rollback transaction > > > > OP_TX_CLOSE, 4004, Close transaction > > > > > > > > From the client side (java) new interfaces will be added: > > > > > > > > public interface ClientTransactions { > > > > public ClientTransaction txStart(); > > > > public ClientTransaction txStart(TransactionConcurrency > > concurrency, > > > > TransactionIsolation isolation); > > > > public ClientTransaction txStart(TransactionConcurrency > > concurrency, > > > > TransactionIsolation isolation, long timeout, int txSize); > > > > public ClientTransaction tx(); // Get current connection > > transaction > > > > public ClientTransactions withLabel(String lb); > > > > } > > > > > > > > public interface ClientTransaction extends AutoCloseable { > > > > public IgniteUuid xid(); // Do we need it? > > > > public TransactionIsolation isolation(); > > > > public TransactionConcurrency concurrency(); > > > > public long timeout(); > > > > public String label(); > > > > > > > > public void commit(); > > > > public void rollback(); > > > > public void close(); > > > > } > > > > > > > > From the server side, I think as a first step (while transactions > > > > suspend/resume is not fully implemented) we can use the same approach > > as > > > > for JDBC: add a new worker to each ClientRequestHandler and process > > > > requests by this worker if the transaction is started explicitly. > > > > ClientRequestHandler is bound to client connection, so there will be > > 1:1 > > > > relation between client connection and thread, which process > operations > > > in > > > > a transaction. > > > > > > > > Also, there is a couple of issues I want to discuss: > > > > > > > > We have overloaded method txStart with a different set of arguments. > > Some > > > > of the arguments may be missing. To pass arguments with OP_TX_START > > > > operation we have the next options: > > > > * Serialize full set of arguments and use some value for missing > > > > arguments. For example -1 for int/long types and null for string > type. > > We > > > > can't use 0 for int/long types since 0 it's a valid value for > > > concurrency, > > > > isolation and timeout arguments. > > > > * Serialize arguments as a collection of property-value pairs (like > > it's > > > > implemented now for CacheConfiguration). In this case only explicitly > > > > provided arguments will be serialized. > > > > Which way is better? The simplest solution is to use the first option > > > and I > > > > want to use it if there were no objections. > > > > > > > > Do we need transaction id (xid) on the client side? > > > > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK, > > > > OP_TX_CLOSE operations back to the server and do additional check on > > the > > > > server side (current transaction id for connection == transaction id > > >
Re: Thin client: transactions support
Sergey, we have the close() method in the thick client, it's behavior is slightly different than rollback() method (it should rollback if the transaction is not committed and do nothing if the transaction is already committed). I think we should support try-with-resource semantics in the thin client and OP_TX_CLOSE will be useful here. Nikolay, suspend/resume didn't work yet for pessimistic transactions. Also, the main goal of suspend/resume operations is to support transaction passing between threads. In the thin client, the transaction is bound to the client connection, not client thread. I think passing a transaction between different client connections is not a very useful case. вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov : > Hello, Alex. > > We also have suspend and resume operations. > I think we should support them > > вт, 26 марта 2019 г., 22:07 Sergey Kozlov : > > > Hi > > > > Looks like I missed something but why we need OP_TX_CLOSE operation? > > > > Also I suggest to reserve a code for SAVEPOINT operation which very > useful > > to understand where transaction has been rolled back > > > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov > > wrote: > > > > > Hello Igniters! > > > > > > I want to pick up the ticket IGNITE-7369 and add transactions support > to > > > our thin client implementation. > > > I've looked at our current implementation and have some proposals to > > > support transactions: > > > > > > Add new operations to thin client protocol: > > > > > > OP_TX_GET, 4000, Get current transaction for client connection > > > OP_TX_START, 4001, Start a new transaction > > > OP_TX_COMMIT, 4002, Commit transaction > > > OP_TX_ROLLBACK, 4003, Rollback transaction > > > OP_TX_CLOSE, 4004, Close transaction > > > > > > From the client side (java) new interfaces will be added: > > > > > > public interface ClientTransactions { > > > public ClientTransaction txStart(); > > > public ClientTransaction txStart(TransactionConcurrency > concurrency, > > > TransactionIsolation isolation); > > > public ClientTransaction txStart(TransactionConcurrency > concurrency, > > > TransactionIsolation isolation, long timeout, int txSize); > > > public ClientTransaction tx(); // Get current connection > transaction > > > public ClientTransactions withLabel(String lb); > > > } > > > > > > public interface ClientTransaction extends AutoCloseable { > > > public IgniteUuid xid(); // Do we need it? > > > public TransactionIsolation isolation(); > > > public TransactionConcurrency concurrency(); > > > public long timeout(); > > > public String label(); > > > > > > public void commit(); > > > public void rollback(); > > > public void close(); > > > } > > > > > > From the server side, I think as a first step (while transactions > > > suspend/resume is not fully implemented) we can use the same approach > as > > > for JDBC: add a new worker to each ClientRequestHandler and process > > > requests by this worker if the transaction is started explicitly. > > > ClientRequestHandler is bound to client connection, so there will be > 1:1 > > > relation between client connection and thread, which process operations > > in > > > a transaction. > > > > > > Also, there is a couple of issues I want to discuss: > > > > > > We have overloaded method txStart with a different set of arguments. > Some > > > of the arguments may be missing. To pass arguments with OP_TX_START > > > operation we have the next options: > > > * Serialize full set of arguments and use some value for missing > > > arguments. For example -1 for int/long types and null for string type. > We > > > can't use 0 for int/long types since 0 it's a valid value for > > concurrency, > > > isolation and timeout arguments. > > > * Serialize arguments as a collection of property-value pairs (like > it's > > > implemented now for CacheConfiguration). In this case only explicitly > > > provided arguments will be serialized. > > > Which way is better? The simplest solution is to use the first option > > and I > > > want to use it if there were no objections. > > > > > > Do we need transaction id (xid) on the client side? > > > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK, > > > OP_TX_CLOSE operations back to the server and do additional check on > the > > > server side (current transaction id for connection == transaction id > > passed > > > from client side). This, perhaps, will protect clients against some > > errors > > > (for example when client try to commit outdated transaction). But > > > currently, we don't have data type IgniteUuid in thin client protocol. > Do > > > we need to add it too? > > > Also, we can pass xid as a string just to inform the client and do not > > pass > > > it back to the server with commit/rollback operation. > > > Or not to pass xid at all (.NET thick client works this way as far as I > > > know). > > > > > > What do you think? > > > > > > ср, 7 мар.
Re: Thin client: transactions support
Hello, Alex. We also have suspend and resume operations. I think we should support them вт, 26 марта 2019 г., 22:07 Sergey Kozlov : > Hi > > Looks like I missed something but why we need OP_TX_CLOSE operation? > > Also I suggest to reserve a code for SAVEPOINT operation which very useful > to understand where transaction has been rolled back > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov > wrote: > > > Hello Igniters! > > > > I want to pick up the ticket IGNITE-7369 and add transactions support to > > our thin client implementation. > > I've looked at our current implementation and have some proposals to > > support transactions: > > > > Add new operations to thin client protocol: > > > > OP_TX_GET, 4000, Get current transaction for client connection > > OP_TX_START, 4001, Start a new transaction > > OP_TX_COMMIT, 4002, Commit transaction > > OP_TX_ROLLBACK, 4003, Rollback transaction > > OP_TX_CLOSE, 4004, Close transaction > > > > From the client side (java) new interfaces will be added: > > > > public interface ClientTransactions { > > public ClientTransaction txStart(); > > public ClientTransaction txStart(TransactionConcurrency concurrency, > > TransactionIsolation isolation); > > public ClientTransaction txStart(TransactionConcurrency concurrency, > > TransactionIsolation isolation, long timeout, int txSize); > > public ClientTransaction tx(); // Get current connection transaction > > public ClientTransactions withLabel(String lb); > > } > > > > public interface ClientTransaction extends AutoCloseable { > > public IgniteUuid xid(); // Do we need it? > > public TransactionIsolation isolation(); > > public TransactionConcurrency concurrency(); > > public long timeout(); > > public String label(); > > > > public void commit(); > > public void rollback(); > > public void close(); > > } > > > > From the server side, I think as a first step (while transactions > > suspend/resume is not fully implemented) we can use the same approach as > > for JDBC: add a new worker to each ClientRequestHandler and process > > requests by this worker if the transaction is started explicitly. > > ClientRequestHandler is bound to client connection, so there will be 1:1 > > relation between client connection and thread, which process operations > in > > a transaction. > > > > Also, there is a couple of issues I want to discuss: > > > > We have overloaded method txStart with a different set of arguments. Some > > of the arguments may be missing. To pass arguments with OP_TX_START > > operation we have the next options: > > * Serialize full set of arguments and use some value for missing > > arguments. For example -1 for int/long types and null for string type. We > > can't use 0 for int/long types since 0 it's a valid value for > concurrency, > > isolation and timeout arguments. > > * Serialize arguments as a collection of property-value pairs (like it's > > implemented now for CacheConfiguration). In this case only explicitly > > provided arguments will be serialized. > > Which way is better? The simplest solution is to use the first option > and I > > want to use it if there were no objections. > > > > Do we need transaction id (xid) on the client side? > > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK, > > OP_TX_CLOSE operations back to the server and do additional check on the > > server side (current transaction id for connection == transaction id > passed > > from client side). This, perhaps, will protect clients against some > errors > > (for example when client try to commit outdated transaction). But > > currently, we don't have data type IgniteUuid in thin client protocol. Do > > we need to add it too? > > Also, we can pass xid as a string just to inform the client and do not > pass > > it back to the server with commit/rollback operation. > > Or not to pass xid at all (.NET thick client works this way as far as I > > know). > > > > What do you think? > > > > ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov : > > > > > We already have transactions support in JDBC driver in TX SQL branch > > > (ignite-4191). Currently it is implemented through separate thread, > which > > > is not that efficient. Ideally we need to finish decoupling > transactions > > > from threads. But alternatively we can change the logic on how we > assign > > > thread ID to specific transaction and "impersonate" thin client worker > > > threads when serving requests from multiple users. > > > > > > > > > > > > On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda > wrote: > > > > > > > Here is an original discussion with a reference to the JIRA ticket: > > > > http://apache-ignite-developers.2346864.n4.nabble. > > > > com/Re-Transaction-operations-using-the-Ignite-Thin-Client- > > > > Protocol-td25914.html > > > > > > > > -- > > > > Denis > > > > > > > > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan < > > dsetrak...@apache.org > > > > > > > > wrote: > > > >
Re: Thin client: transactions support
Hi Looks like I missed something but why we need OP_TX_CLOSE operation? Also I suggest to reserve a code for SAVEPOINT operation which very useful to understand where transaction has been rolled back On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov wrote: > Hello Igniters! > > I want to pick up the ticket IGNITE-7369 and add transactions support to > our thin client implementation. > I've looked at our current implementation and have some proposals to > support transactions: > > Add new operations to thin client protocol: > > OP_TX_GET, 4000, Get current transaction for client connection > OP_TX_START, 4001, Start a new transaction > OP_TX_COMMIT, 4002, Commit transaction > OP_TX_ROLLBACK, 4003, Rollback transaction > OP_TX_CLOSE, 4004, Close transaction > > From the client side (java) new interfaces will be added: > > public interface ClientTransactions { > public ClientTransaction txStart(); > public ClientTransaction txStart(TransactionConcurrency concurrency, > TransactionIsolation isolation); > public ClientTransaction txStart(TransactionConcurrency concurrency, > TransactionIsolation isolation, long timeout, int txSize); > public ClientTransaction tx(); // Get current connection transaction > public ClientTransactions withLabel(String lb); > } > > public interface ClientTransaction extends AutoCloseable { > public IgniteUuid xid(); // Do we need it? > public TransactionIsolation isolation(); > public TransactionConcurrency concurrency(); > public long timeout(); > public String label(); > > public void commit(); > public void rollback(); > public void close(); > } > > From the server side, I think as a first step (while transactions > suspend/resume is not fully implemented) we can use the same approach as > for JDBC: add a new worker to each ClientRequestHandler and process > requests by this worker if the transaction is started explicitly. > ClientRequestHandler is bound to client connection, so there will be 1:1 > relation between client connection and thread, which process operations in > a transaction. > > Also, there is a couple of issues I want to discuss: > > We have overloaded method txStart with a different set of arguments. Some > of the arguments may be missing. To pass arguments with OP_TX_START > operation we have the next options: > * Serialize full set of arguments and use some value for missing > arguments. For example -1 for int/long types and null for string type. We > can't use 0 for int/long types since 0 it's a valid value for concurrency, > isolation and timeout arguments. > * Serialize arguments as a collection of property-value pairs (like it's > implemented now for CacheConfiguration). In this case only explicitly > provided arguments will be serialized. > Which way is better? The simplest solution is to use the first option and I > want to use it if there were no objections. > > Do we need transaction id (xid) on the client side? > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK, > OP_TX_CLOSE operations back to the server and do additional check on the > server side (current transaction id for connection == transaction id passed > from client side). This, perhaps, will protect clients against some errors > (for example when client try to commit outdated transaction). But > currently, we don't have data type IgniteUuid in thin client protocol. Do > we need to add it too? > Also, we can pass xid as a string just to inform the client and do not pass > it back to the server with commit/rollback operation. > Or not to pass xid at all (.NET thick client works this way as far as I > know). > > What do you think? > > ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov : > > > We already have transactions support in JDBC driver in TX SQL branch > > (ignite-4191). Currently it is implemented through separate thread, which > > is not that efficient. Ideally we need to finish decoupling transactions > > from threads. But alternatively we can change the logic on how we assign > > thread ID to specific transaction and "impersonate" thin client worker > > threads when serving requests from multiple users. > > > > > > > > On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda wrote: > > > > > Here is an original discussion with a reference to the JIRA ticket: > > > http://apache-ignite-developers.2346864.n4.nabble. > > > com/Re-Transaction-operations-using-the-Ignite-Thin-Client- > > > Protocol-td25914.html > > > > > > -- > > > Denis > > > > > > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan < > dsetrak...@apache.org > > > > > > wrote: > > > > > > > Hi Dmitriy. I don't think we have a design proposal for transaction > > > support > > > > in thin clients. Do you mind taking this initiative and creating an > IEP > > > on > > > > Wiki? > > > > > > > > D. > > > > > > > > On Tue, Mar 6, 2018 at 8:46 AM, Dmitriy Govorukhin < > > > > dmitriy.govoruk...@gmail.com> wrote: > > > > > > > > > Hi, Igniters. > > > > > > > >
Re: Ignite 2.7.5 Release scope
Hi, I've cherry-picked this commit. It seems it is critical because it also fixes storage corruption. Sincerely, Dmitriy Pavlov вт, 26 мар. 2019 г. в 14:14, Zhenya Stanilovsky : > I suppose this ticket [1] : is very useful too. > > > [1] https://issues.apache.org/jira/browse/IGNITE-10873 [ > CorruptedTreeException during simultaneous cache put operations ] > > > > > > >--- Forwarded message --- > >From: "Alexey Goncharuk" < alexey.goncha...@gmail.com > > >To: dev < dev@ignite.apache.org > > >Cc: > >Subject: Re: Ignite 2.7.5 Release scope > >Date: Tue, 26 Mar 2019 13:42:59 +0300 > > > >Hello Ilya, > > > >I do not see any issues with the mentioned test. I see the following > output > >in the logs: > > > >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>> > >Stopping test: > > >TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode > >in 37768 ms <<< > >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> > >Stopping test class: TcpDiscoveryCoordinatorFailureTest <<< > >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> > >Starting test class: IgniteClientConnectTest <<< > > > >The issue with Windows may be long connection timeouts, in this case we > >should either split the suite into multiple ones or decrease the SPI > >timeouts. > > > >пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev < ilya.kasnach...@gmail.com > >: > > > >> Hello! > >> > >> It seems that I can no longer test this case, on account of > >> > >> > TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized > >> hanging every time under Java 11 on Windows. > >> > >> Alexey, Ivan, can you please take a look? > >> > >> > >> > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > >> > >> Regards, > >> > >> -- > >> Ilya Kasnacheev > >> > >> > >> пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev < > ilya.kasnach...@gmail.com >: > >> > >> > Hello! > >> > > >> > Basically there is a test that explicitly highlights this problem, > that > >> is > >> > running SSL tests on Windows + Java 11. They will hang on Master but > >> pass > >> > with this patch. > >> > > >> > I have started that on TC, results will probably be available later > >> today: > >> > > >> > > >> > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > >> > (mind the Java version). > >> > > >> > Regards, > >> > -- > >> > Ilya Kasnacheev > >> > > >> > > >> > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov < maxmu...@gmail.com >: > >> > > >> >> Dmitry, Ilya, > >> >> > >> >> Yes, I've looked through those changes [1] as they can affect my > local > >> >> PR. Basically, changes look good to me. > >> >> > >> >> I'm not an expert with CommunicationSpi component, so can miss some > >> >> details and I haven't tested these changes under Java 11. One more > >> >> thing I'd like to say, I would add additional tests to PR that will > >> >> explicitly highlight the problem being solved. > >> >> > >> >> > >> >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 > >> >> > >> >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov < dpav...@apache.org > > >> wrote: > >> >> > > >> >> > Hi Igniters, > >> >> > > >> >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy > >> wait > >> >> on > >> >> > processWrite during SSL handshake. > >> >> > seems to be blocker cause it is related to Java 11 > >> >> > > >> >> > I see Maxim M left some comments. Ilya K., Maxim M.were these > >> comments > >> >> > addressed? > >> >> > > >> >> > The ticket is in Patch Available. Reviewer needed. Changes located > >> in > >> >> > GridNioServer. > >> >> > > >> >> > Sincerely, > >> >> > Dmitriy Pavlov > >> >> > > >> >> > P.S. a quite obvious ticket came to sope, as well: > >> >> > https://issues.apache.org/jira/browse/IGNITE-11600 > >> >> > > >> >> > > >> >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov < mr.wei...@gmail.com >: > >> >> > > >> >> > > Huge +1 > >> >> > > > >> >> > > Will try to add new JDK in nearest time to our Teamcity. > >> >> > > > >> >> > > > >> >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov < dpav...@apache.org > > > >> >> wrote: > >> >> > > > > >> >> > > > Hi Igniters, > >> >> > > > > >> >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our > >> new > >> >> tests > >> >> > > > scripts with a couple of Java builds. WDYT? > >> >> > > > > >> >> > > > Sincerely, > >> >> > > > Dmitriy Pavlov > >> >> > > > > >> >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov > >> < dpav...@apache.org >: > >> >> > > > > >> >> > > >> Hi Ignite Developers, > >> >> > > >> > >> >> > > >> In a separate discussion, I've shared a log with all commits. > >> >> > > >> > >> >> > > >> As far as I can see, nobody removed commits from this sheet, > so > >> the > >> >> > > scope > >> >> > > >> of release will be
Re: Ignite 2.7.5 Release scope
Hello. Yes, locally this test seems to pass. However, no luck on TC. Maybe my commit is positioned on top of especially unlucky HEAD. Anyway, my point was thatTcpDiscoverySslTrustedUntrustedTest (or any other intra-node SSL test) is a sufficient test for IGNITE-11299. It will very reliably hang on Windows/Java 11 without patch and will always pass with my patch (and TLSv1.2). So no additional test is needed - we are testing a known regression here. Regards, -- Ilya Kasnacheev вт, 26 мар. 2019 г. в 13:43, Alexey Goncharuk : > Hello Ilya, > > I do not see any issues with the mentioned test. I see the following output > in the logs: > > [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>> > Stopping test: > > TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode > in 37768 ms <<< > [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> > Stopping test class: TcpDiscoveryCoordinatorFailureTest <<< > [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> > Starting test class: IgniteClientConnectTest <<< > > The issue with Windows may be long connection timeouts, in this case we > should either split the suite into multiple ones or decrease the SPI > timeouts. > > пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev : > > > Hello! > > > > It seems that I can no longer test this case, on account of > > > > > TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized > > hanging every time under Java 11 on Windows. > > > > Alexey, Ivan, can you please take a look? > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > > > > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev >: > > > > > Hello! > > > > > > Basically there is a test that explicitly highlights this problem, that > > is > > > running SSL tests on Windows + Java 11. They will hang on Master but > pass > > > with this patch. > > > > > > I have started that on TC, results will probably be available later > > today: > > > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > (mind the Java version). > > > > > > Regards, > > > -- > > > Ilya Kasnacheev > > > > > > > > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov : > > > > > >> Dmitry, Ilya, > > >> > > >> Yes, I've looked through those changes [1] as they can affect my local > > >> PR. Basically, changes look good to me. > > >> > > >> I'm not an expert with CommunicationSpi component, so can miss some > > >> details and I haven't tested these changes under Java 11. One more > > >> thing I'd like to say, I would add additional tests to PR that will > > >> explicitly highlight the problem being solved. > > >> > > >> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 > > >> > > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov > > wrote: > > >> > > > >> > Hi Igniters, > > >> > > > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy > > wait > > >> on > > >> > processWrite during SSL handshake. > > >> > seems to be blocker cause it is related to Java 11 > > >> > > > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these > comments > > >> > addressed? > > >> > > > >> > The ticket is in Patch Available. Reviewer needed. Changes located > in > > >> > GridNioServer. > > >> > > > >> > Sincerely, > > >> > Dmitriy Pavlov > > >> > > > >> > P.S. a quite obvious ticket came to sope, as well: > > >> > https://issues.apache.org/jira/browse/IGNITE-11600 > > >> > > > >> > > > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov : > > >> > > > >> > > Huge +1 > > >> > > > > >> > > Will try to add new JDK in nearest time to our Teamcity. > > >> > > > > >> > > > > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov > > >> wrote: > > >> > > > > > >> > > > Hi Igniters, > > >> > > > > > >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our > > new > > >> tests > > >> > > > scripts with a couple of Java builds. WDYT? > > >> > > > > > >> > > > Sincerely, > > >> > > > Dmitriy Pavlov > > >> > > > > > >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov >: > > >> > > > > > >> > > >> Hi Ignite Developers, > > >> > > >> > > >> > > >> In a separate discussion, I've shared a log with all commits. > > >> > > >> > > >> > > >> As far as I can see, nobody removed commits from this sheet, so > > the > > >> > > scope > > >> > > >> of release will be discussed in another way: only explicitly > > >> declared > > >> > > >> commits will be cherry-picked. > > >> > > >> > > >> > > >> Sincerely, > > >> > > >> Dmitriy Pavlov > > >> > > >> > > >> > > > > >> > > > > >> > > > > > >
Re: GridDhtInvalidPartitionException takes the cluster down
CleanupWorker termination can lead to the following effects: - Queries can retrieve data that have to expired so application will behave incorrectly. - Memory and/or disc can be overflowed because entries weren't expired. - Performance degradation is possible due to unmanageable data set grows. On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh wrote: > > Vyacheslav, if you are talking about this particular case I described, I > believe it has no influence on PME. What could happen is having CleanupWorker > thread dead (which is not good too).But I believe we are talking in a wider > scope. > > -- Roman > > > On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur > wrote: > > In general I agree with Andrey, the handler is very usefull itself. It > allows us to become know that ‘GridDhtInvalidPartitionException’ is not > processed properly in PME process by worker. > > Nikolay, look at the code, if Failure Handler hadles an exception - this > means that while-true loop in worker’s body has been interrupted with > unexpected exception and thread is completed his lifecycle. > > Without Failure Hanller, in the current case, the cluster will hang, > because of unable to participate in PME process. > > So, the problem is the incorrect handling of the exception in PME’s task > wich should be fixed. > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov : > > > Nikolay, > > > > Feel free to suggest better error messages to indicate internal/critical > > failures. User actions in response to critical failures are rather limited: > > mail to user-list or maybe file an issue. As for repetitive warnings, it > > makes sense, but requires additional stuff to deliver such signals, mere > > spamming to log will not have an effect. > > > > Anyway, when experienced committers suggest to disable failure handling and > > hide existing issues, I feel as if they are pulling my leg. > > > > Best regards, > > Andrey Kuznetsov. > > > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > > > > > Andrey. > > > > > > > the thread can be made non-critical, and we can restart it every time > > it > > > dies > > > > > > Why we can't restart critical thread? > > > What is the root difference between critical and non critical threads? > > > > > > > It's much simpler to catch and handle all exceptions in critical > > threads > > > > > > I don't agree with you. > > > We develop Ignite not because it simple! > > > We must spend extra time to made it robust and resilient to the failures. > > > > > > > Failure handling is a last-chance tool that reveals internal Ignite > > > errors > > > > 100% agree with you: overcome, but not hide. > > > > > > Logging stack trace with proper explanation is not hiding. > > > Killing nodes and whole cluster is not "handling". > > > > > > > As far as I see from user-list messages, our users are qualified enough > > > to provide necessary information from their cluster-wide logs. > > > > > > We shouldn't develop our product only for users who are able to read > > Ignite > > > sources to decrypt the fail reason behind "starvation in stripped pool" > > > > > > Some of my questions remain unanswered :) : > > > > > > 1. How user can know it's an Ignite bug? Where this bug should be > > reported? > > > 2. Do we log it somewhere? > > > 3. Do we warn user before shutdown several times? > > > 4. "starvation in stripped pool" I think it's not clear error message. > > > Let's make it more specific! > > > 5. Let's write to the user log - what he or she should do to prevent this > > > error in future? > > > > > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > > > > > Nikolay, > > > > > > > > > Why we can't restart some thread? > > > > Technically, we can. It's just matter of design: the thread can be made > > > > non-critical, and we can restart it every time it dies. But such design > > > > looks poor to me. It's much simpler to catch and handle all exceptions > > in > > > > critical threads. Failure handling is a last-chance tool that reveals > > > > internal Ignite errors. It's not pleasant for us when users see these > > > > errors, but it's better than hiding. > > > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > > thread > > > > failure, node failure, for example, isn't it? > > > > 100% agree with you: overcome, but not hide. > > > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > > As far as I see from user-list messages, our users are qualified enough > > > to > > > > provide necessary information from their cluster-wide logs. > > > > > > > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > > > > > > > Andrey. > > > > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no > > use > > > > to > > > > > wait for dead thread's magical resurrection. > > > > > > > > > > Why is it unrecoverable? > > > > > Why we can't restart some thread? > > > > > Is there some kind of
Re: GridDhtInvalidPartitionException takes the cluster down
Igniters, 1. First of all, I want to remind you why failure handles were implemented. Please take a look to IEP-14 [1] and corresponding discussion on dev-list [2] (quite emotional discussion). This sources also answer on some questions from previous posts of this topic. 2. Note that the following failure types are ignored by default (BUT this fixes ARE NOT included to 2.7): - SYSTEM_WORKER_BLOCKED: Unresponsive critical thread for a long time is a problem but we don't know why it happened (possibly slow environment) so we just ignore this failure. - SYSTEM_CRITICAL_OPERATION_TIMEOUT: At the moment it is related only with checkpoint read lock acquisition. So we already have more or less adequate defaults. 3. About SYSTEM_WORKER_TERMINATION failure type. Restarting thread is very bad idea because we already have system in undefined state and system behavior is unpredictable from this point. For example discovery thread is critical part of discovery protocol. If discovery thread on some node is terminated during discovery message processing then: - Protocol is already broken because message will not send to the next node in the ring, so we can't ignore this failure because whole cluster will suffer in this case; - But we can restart thread and even try to process the same message once again. And what? The same error will happen with high probability and discovery thread will be terminated again. 4. About enabling the failure handler for things like transactions or PME and have it off for check pointing and something else. Failure handler is a general component. It isn't related with some kind of functionality (e.g. tx, PME or check pointing). We only can to manage the behavior of configured failure handler in case of particular failure type. See p.2 above. 5. About providing hints on how to come around the shutdown in the future I really don't like analogies but I believe it will be appropriate to our discussion. What kind of hint can provide JVM in case AssertionError? It is right for failure handler also. Failure handler is the last resort and only thing than handler can provide is some information about failure. In our case this information contains failure context, thread name and thread dump. 6. About protection for a full cluster restart Failure handler is node local entity. If whole cluster is restarted/stopped due to a some failure it means only one - on each cluster node some critical failure happened. It means that we can't protect cluster from shutting down in current failure model. More complex failure model can be implemented which will require decision about node stopping from all cluster nodes (or some subset - quorum). But it require additional research and discussion. 7. About user experience Yes, "starvation in stripped pool" message isn't clear enough for... hmmm... user. But it is definitely clear for developer. And I've no idea about clear message for user. So... Are you have an idea? You are welcome! It is easy to say that something is wrong but it is hard to make it right. Also I believe that user experience will not better in cases of frozen cluster instead of failed cluster. And user will not more happy if we log more messages like "cluster will be stopped". And unfortunately we can't explain users what he or she should to do in order to prevent this error in future because we ourselves don't know what to in this case. Every failure is actually bug that should be investigated and fixed. Less bugs is the thing that can improve user experience. Links: 1. https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling 2. http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh wrote: > > Vyacheslav, if you are talking about this particular case I described, I > believe it has no influence on PME. What could happen is having CleanupWorker > thread dead (which is not good too).But I believe we are talking in a wider > scope. > > -- Roman > > > On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur > wrote: > > In general I agree with Andrey, the handler is very usefull itself. It > allows us to become know that ‘GridDhtInvalidPartitionException’ is not > processed properly in PME process by worker. > > Nikolay, look at the code, if Failure Handler hadles an exception - this > means that while-true loop in worker’s body has been interrupted with > unexpected exception and thread is completed his lifecycle. > > Without Failure Hanller, in the current case, the cluster will hang, > because of unable to participate in PME process. > > So, the problem is the incorrect handling of the exception in PME’s task > wich should be fixed. > > > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov : > > > Nikolay, > > > > Feel free to suggest better error messages to indicate internal/critical > > failures. User actions in
[jira] [Created] (IGNITE-11633) Fix errors in WAL disabled archive mode documentation
Alexey Goncharuk created IGNITE-11633: - Summary: Fix errors in WAL disabled archive mode documentation Key: IGNITE-11633 URL: https://issues.apache.org/jira/browse/IGNITE-11633 Project: Ignite Issue Type: Task Components: documentation Reporter: Alexey Goncharuk In https://apacheignite.readme.io/docs/write-ahead-log#section-disabling-wal-archiving there is an error. The documentation says that " instead, it will overwrite the active segments in a cyclical order". In fact, when walWork == walArchive, the whole folder behaves as a sequential log, where new files are sequentially created (0, 1, 2, 3, ...) and old files are eventually truncated. Also, need to clarify the wal size setting in this mode. Ask [~dpavlov] and [~akalashnikov] for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Thin client: transactions support
Hello Igniters! I want to pick up the ticket IGNITE-7369 and add transactions support to our thin client implementation. I've looked at our current implementation and have some proposals to support transactions: Add new operations to thin client protocol: OP_TX_GET, 4000, Get current transaction for client connection OP_TX_START, 4001, Start a new transaction OP_TX_COMMIT, 4002, Commit transaction OP_TX_ROLLBACK, 4003, Rollback transaction OP_TX_CLOSE, 4004, Close transaction >From the client side (java) new interfaces will be added: public interface ClientTransactions { public ClientTransaction txStart(); public ClientTransaction txStart(TransactionConcurrency concurrency, TransactionIsolation isolation); public ClientTransaction txStart(TransactionConcurrency concurrency, TransactionIsolation isolation, long timeout, int txSize); public ClientTransaction tx(); // Get current connection transaction public ClientTransactions withLabel(String lb); } public interface ClientTransaction extends AutoCloseable { public IgniteUuid xid(); // Do we need it? public TransactionIsolation isolation(); public TransactionConcurrency concurrency(); public long timeout(); public String label(); public void commit(); public void rollback(); public void close(); } >From the server side, I think as a first step (while transactions suspend/resume is not fully implemented) we can use the same approach as for JDBC: add a new worker to each ClientRequestHandler and process requests by this worker if the transaction is started explicitly. ClientRequestHandler is bound to client connection, so there will be 1:1 relation between client connection and thread, which process operations in a transaction. Also, there is a couple of issues I want to discuss: We have overloaded method txStart with a different set of arguments. Some of the arguments may be missing. To pass arguments with OP_TX_START operation we have the next options: * Serialize full set of arguments and use some value for missing arguments. For example -1 for int/long types and null for string type. We can't use 0 for int/long types since 0 it's a valid value for concurrency, isolation and timeout arguments. * Serialize arguments as a collection of property-value pairs (like it's implemented now for CacheConfiguration). In this case only explicitly provided arguments will be serialized. Which way is better? The simplest solution is to use the first option and I want to use it if there were no objections. Do we need transaction id (xid) on the client side? If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK, OP_TX_CLOSE operations back to the server and do additional check on the server side (current transaction id for connection == transaction id passed from client side). This, perhaps, will protect clients against some errors (for example when client try to commit outdated transaction). But currently, we don't have data type IgniteUuid in thin client protocol. Do we need to add it too? Also, we can pass xid as a string just to inform the client and do not pass it back to the server with commit/rollback operation. Or not to pass xid at all (.NET thick client works this way as far as I know). What do you think? ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov : > We already have transactions support in JDBC driver in TX SQL branch > (ignite-4191). Currently it is implemented through separate thread, which > is not that efficient. Ideally we need to finish decoupling transactions > from threads. But alternatively we can change the logic on how we assign > thread ID to specific transaction and "impersonate" thin client worker > threads when serving requests from multiple users. > > > > On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda wrote: > > > Here is an original discussion with a reference to the JIRA ticket: > > http://apache-ignite-developers.2346864.n4.nabble. > > com/Re-Transaction-operations-using-the-Ignite-Thin-Client- > > Protocol-td25914.html > > > > -- > > Denis > > > > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan > > > wrote: > > > > > Hi Dmitriy. I don't think we have a design proposal for transaction > > support > > > in thin clients. Do you mind taking this initiative and creating an IEP > > on > > > Wiki? > > > > > > D. > > > > > > On Tue, Mar 6, 2018 at 8:46 AM, Dmitriy Govorukhin < > > > dmitriy.govoruk...@gmail.com> wrote: > > > > > > > Hi, Igniters. > > > > > > > > I've seen a lot of discussions about thin client and binary protocol, > > > but I > > > > did not hear anything about transactions support. Do we have some > draft > > > for > > > > this purpose? > > > > > > > > As I understand we have several problems: > > > > > > > >- thread and transaction have hard related (we use thread-local > > > variable > > > >and thread name) > > > >- we can process only one transaction at the same time in one > thread > > > (it > > > >mean we need
[jira] [Created] (IGNITE-11632) Node can't start if WAL is corrupted and the wal archiver disabled.
Stepachev Maksim created IGNITE-11632: - Summary: Node can't start if WAL is corrupted and the wal archiver disabled. Key: IGNITE-11632 URL: https://issues.apache.org/jira/browse/IGNITE-11632 Project: Ignite Issue Type: Bug Affects Versions: 2.7, 2.6, 2.5 Reporter: Stepachev Maksim Assignee: Stepachev Maksim Fix For: 2.7, 2.6, 2.5 If you start node without the wal archiver and your last segment page has the wrong CRC, the node stops with an exception. {code:java} Caused by: class org.apache.ignite.IgniteCheckedException: Failed to read WAL record at position: 234728337 size: 268435456 at org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV1Serializer.readWithCrc(RecordV1Serializer.java:394) at org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV2Serializer.readRecord(RecordV2Serializer.java:235) at org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advanceRecord(AbstractWalRecordsIterator.java:243) ... 23 more Caused by: class org.apache.ignite.internal.processors.cache.persistence.wal.crc.IgniteDataIntegrityViolationException: val: -202263192 writtenCrc: 0 at org.apache.ignite.internal.processors.cache.persistence.wal.io.FileInput$Crc32CheckingFileInput.close(FileInput.java:106) at org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV1Serializer.readWithCrc(RecordV1Serializer.java:380) ... 25 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: GridDhtInvalidPartitionException takes the cluster down
Vyacheslav, if you are talking about this particular case I described, I believe it has no influence on PME. What could happen is having CleanupWorker thread dead (which is not good too).But I believe we are talking in a wider scope. -- Roman On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur wrote: In general I agree with Andrey, the handler is very usefull itself. It allows us to become know that ‘GridDhtInvalidPartitionException’ is not processed properly in PME process by worker. Nikolay, look at the code, if Failure Handler hadles an exception - this means that while-true loop in worker’s body has been interrupted with unexpected exception and thread is completed his lifecycle. Without Failure Hanller, in the current case, the cluster will hang, because of unable to participate in PME process. So, the problem is the incorrect handling of the exception in PME’s task wich should be fixed. вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov : > Nikolay, > > Feel free to suggest better error messages to indicate internal/critical > failures. User actions in response to critical failures are rather limited: > mail to user-list or maybe file an issue. As for repetitive warnings, it > makes sense, but requires additional stuff to deliver such signals, mere > spamming to log will not have an effect. > > Anyway, when experienced committers suggest to disable failure handling and > hide existing issues, I feel as if they are pulling my leg. > > Best regards, > Andrey Kuznetsov. > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > > > Andrey. > > > > > the thread can be made non-critical, and we can restart it every time > it > > dies > > > > Why we can't restart critical thread? > > What is the root difference between critical and non critical threads? > > > > > It's much simpler to catch and handle all exceptions in critical > threads > > > > I don't agree with you. > > We develop Ignite not because it simple! > > We must spend extra time to made it robust and resilient to the failures. > > > > > Failure handling is a last-chance tool that reveals internal Ignite > > errors > > > 100% agree with you: overcome, but not hide. > > > > Logging stack trace with proper explanation is not hiding. > > Killing nodes and whole cluster is not "handling". > > > > > As far as I see from user-list messages, our users are qualified enough > > to provide necessary information from their cluster-wide logs. > > > > We shouldn't develop our product only for users who are able to read > Ignite > > sources to decrypt the fail reason behind "starvation in stripped pool" > > > > Some of my questions remain unanswered :) : > > > > 1. How user can know it's an Ignite bug? Where this bug should be > reported? > > 2. Do we log it somewhere? > > 3. Do we warn user before shutdown several times? > > 4. "starvation in stripped pool" I think it's not clear error message. > > Let's make it more specific! > > 5. Let's write to the user log - what he or she should do to prevent this > > error in future? > > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > > > Nikolay, > > > > > > > Why we can't restart some thread? > > > Technically, we can. It's just matter of design: the thread can be made > > > non-critical, and we can restart it every time it dies. But such design > > > looks poor to me. It's much simpler to catch and handle all exceptions > in > > > critical threads. Failure handling is a last-chance tool that reveals > > > internal Ignite errors. It's not pleasant for us when users see these > > > errors, but it's better than hiding. > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > thread > > > failure, node failure, for example, isn't it? > > > 100% agree with you: overcome, but not hide. > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > As far as I see from user-list messages, our users are qualified enough > > to > > > provide necessary information from their cluster-wide logs. > > > > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > > > > > Andrey. > > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no > use > > > to > > > > wait for dead thread's magical resurrection. > > > > > > > > Why is it unrecoverable? > > > > Why we can't restart some thread? > > > > Is there some kind of nature limitation to not restart system thread? > > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > thread > > > > failure, node failure, for example, isn't it? > > > > > if under some circumstances node> stop leads to cascade cluster > > crash, > > > > then it's a bug > > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > > Do we log it somewhere? > > > > Do we warn user before shutdown one or several times? > > > > > > > > This feature kills user experience literally now. > > > > > > > > If I would be a user of
[jira] [Created] (IGNITE-11631) Server node with PDS and SSL fails on start with NPE
Sergey Antonov created IGNITE-11631: --- Summary: Server node with PDS and SSL fails on start with NPE Key: IGNITE-11631 URL: https://issues.apache.org/jira/browse/IGNITE-11631 Project: Ignite Issue Type: Bug Affects Versions: 2.7 Reporter: Sergey Antonov Assignee: Sergey Antonov Fix For: 2.8 Server node fails with NPE, if persistence and SSL are enable. Stacktrace: {code:java} java.lang.NullPointerException at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.createSocket(TcpDiscoverySpi.java:1565) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1503) at org.apache.ignite.spi.discovery.tcp.ServerImpl.sendMessageDirectly(ServerImpl.java:1309) at org.apache.ignite.spi.discovery.tcp.ServerImpl.sendJoinRequestMessage(ServerImpl.java:1144) at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:957) at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:422) at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2089) at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:940) at org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1743) at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1085) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1992) at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1683) at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1109) at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:607) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:984) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:925) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:913) at org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:879) at org.apache.ignite.testframework.junits.GridAbstractTest$4.call(GridAbstractTest.java:822) at org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:84) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11630) Document changes to SQL views
Vladimir Ozerov created IGNITE-11630: Summary: Document changes to SQL views Key: IGNITE-11630 URL: https://issues.apache.org/jira/browse/IGNITE-11630 Project: Ignite Issue Type: Task Components: sql Reporter: Vladimir Ozerov Assignee: Artem Budnikov Fix For: 2.8 The following changes were made to our views. {{CACHE_GROUPS}} # {{ID}} -> {{CACHE_GROUP_ID}} # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}} {{LOCAL_CACHE_GROUPS_IO}} # {{GROUP_ID}} -> {{CACHE_GROUP_ID}} # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}} {{CACHES}} # {{NAME}} -> {{CACHE_NAME}} # {{GROUP_ID}} -> {{CACHE_GROUP_ID}} # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}} {{INDEXES}} # {{GROUP_ID}} -> {{CACHE_GROUP_ID}} # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}} {{NODES}} # {{ID}} -> {{NODE_ID}} {{TABLES}} # Added {{CACHE_GROUP_ID}} # Added {{CACHE_GROUP_NAME}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: GridDhtInvalidPartitionException takes the cluster down
In general I agree with Andrey, the handler is very usefull itself. It allows us to become know that ‘GridDhtInvalidPartitionException’ is not processed properly in PME process by worker. Nikolay, look at the code, if Failure Handler hadles an exception - this means that while-true loop in worker’s body has been interrupted with unexpected exception and thread is completed his lifecycle. Without Failure Hanller, in the current case, the cluster will hang, because of unable to participate in PME process. So, the problem is the incorrect handling of the exception in PME’s task wich should be fixed. вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov : > Nikolay, > > Feel free to suggest better error messages to indicate internal/critical > failures. User actions in response to critical failures are rather limited: > mail to user-list or maybe file an issue. As for repetitive warnings, it > makes sense, but requires additional stuff to deliver such signals, mere > spamming to log will not have an effect. > > Anyway, when experienced committers suggest to disable failure handling and > hide existing issues, I feel as if they are pulling my leg. > > Best regards, > Andrey Kuznetsov. > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > > > Andrey. > > > > > the thread can be made non-critical, and we can restart it every time > it > > dies > > > > Why we can't restart critical thread? > > What is the root difference between critical and non critical threads? > > > > > It's much simpler to catch and handle all exceptions in critical > threads > > > > I don't agree with you. > > We develop Ignite not because it simple! > > We must spend extra time to made it robust and resilient to the failures. > > > > > Failure handling is a last-chance tool that reveals internal Ignite > > errors > > > 100% agree with you: overcome, but not hide. > > > > Logging stack trace with proper explanation is not hiding. > > Killing nodes and whole cluster is not "handling". > > > > > As far as I see from user-list messages, our users are qualified enough > > to provide necessary information from their cluster-wide logs. > > > > We shouldn't develop our product only for users who are able to read > Ignite > > sources to decrypt the fail reason behind "starvation in stripped pool" > > > > Some of my questions remain unanswered :) : > > > > 1. How user can know it's an Ignite bug? Where this bug should be > reported? > > 2. Do we log it somewhere? > > 3. Do we warn user before shutdown several times? > > 4. "starvation in stripped pool" I think it's not clear error message. > > Let's make it more specific! > > 5. Let's write to the user log - what he or she should do to prevent this > > error in future? > > > > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > > > Nikolay, > > > > > > > Why we can't restart some thread? > > > Technically, we can. It's just matter of design: the thread can be made > > > non-critical, and we can restart it every time it dies. But such design > > > looks poor to me. It's much simpler to catch and handle all exceptions > in > > > critical threads. Failure handling is a last-chance tool that reveals > > > internal Ignite errors. It's not pleasant for us when users see these > > > errors, but it's better than hiding. > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > thread > > > failure, node failure, for example, isn't it? > > > 100% agree with you: overcome, but not hide. > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > As far as I see from user-list messages, our users are qualified enough > > to > > > provide necessary information from their cluster-wide logs. > > > > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > > > > > Andrey. > > > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no > use > > > to > > > > wait for dead thread's magical resurrection. > > > > > > > > Why is it unrecoverable? > > > > Why we can't restart some thread? > > > > Is there some kind of nature limitation to not restart system thread? > > > > > > > > Actually, distributed systems are designed to overcome some bugs, > > thread > > > > failure, node failure, for example, isn't it? > > > > > if under some circumstances node> stop leads to cascade cluster > > crash, > > > > then it's a bug > > > > > > > > How user can know it's a bug? Where this bug should be reported? > > > > Do we log it somewhere? > > > > Do we warn user before shutdown one or several times? > > > > > > > > This feature kills user experience literally now. > > > > > > > > If I would be a user of the product that just shutdown with poor log > I > > > > would throw this product away. > > > > Do we want it for Ignite? > > > > > > > > From SO discussion I see following error message: ": >>> Possible > > > > starvation in striped pool." > > > > Are you sure this message are clear for Ignite user(not Ignite > hacker)? > > > > What
Re: GridDhtInvalidPartitionException takes the cluster down
I do believe failure handling is useful, but it has to be revisited (including above-mentioned suggestions) because what we have now is not what Ignite promises to do. Disabling it can be a temporal measure until it is improved.Andrey, when you say "hiding", I kind of understand you (even if I don't think we hide), but with the current behavior it's like doing stress tests on users' clusters -- any serious situation/bug can crash the cluster and, in its turn, trust in Ignite. I think this discussion reveals another problem -- we might need something like Jepsen tests etc., which hopefully help us find such issues. AFAIK, CockroachDb has it running for a couple of years. -- Roman On Tuesday, March 26, 2019, 8:24:24 p.m. GMT+9, Andrey Kuznetsov wrote: Nikolay, Feel free to suggest better error messages to indicate internal/critical failures. User actions in response to critical failures are rather limited: mail to user-list or maybe file an issue. As for repetitive warnings, it makes sense, but requires additional stuff to deliver such signals, mere spamming to log will not have an effect. Anyway, when experienced committers suggest to disable failure handling and hide existing issues, I feel as if they are pulling my leg. Best regards, Andrey Kuznetsov. вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > Andrey. > > > the thread can be made non-critical, and we can restart it every time it > dies > > Why we can't restart critical thread? > What is the root difference between critical and non critical threads? > > > It's much simpler to catch and handle all exceptions in critical threads > > I don't agree with you. > We develop Ignite not because it simple! > We must spend extra time to made it robust and resilient to the failures. > > > Failure handling is a last-chance tool that reveals internal Ignite > errors > > 100% agree with you: overcome, but not hide. > > Logging stack trace with proper explanation is not hiding. > Killing nodes and whole cluster is not "handling". > > > As far as I see from user-list messages, our users are qualified enough > to provide necessary information from their cluster-wide logs. > > We shouldn't develop our product only for users who are able to read Ignite > sources to decrypt the fail reason behind "starvation in stripped pool" > > Some of my questions remain unanswered :) : > > 1. How user can know it's an Ignite bug? Where this bug should be reported? > 2. Do we log it somewhere? > 3. Do we warn user before shutdown several times? > 4. "starvation in stripped pool" I think it's not clear error message. > Let's make it more specific! > 5. Let's write to the user log - what he or she should do to prevent this > error in future? > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > Nikolay, > > > > > Why we can't restart some thread? > > Technically, we can. It's just matter of design: the thread can be made > > non-critical, and we can restart it every time it dies. But such design > > looks poor to me. It's much simpler to catch and handle all exceptions in > > critical threads. Failure handling is a last-chance tool that reveals > > internal Ignite errors. It's not pleasant for us when users see these > > errors, but it's better than hiding. > > > > > Actually, distributed systems are designed to overcome some bugs, > thread > > failure, node failure, for example, isn't it? > > 100% agree with you: overcome, but not hide. > > > > > How user can know it's a bug? Where this bug should be reported? > > As far as I see from user-list messages, our users are qualified enough > to > > provide necessary information from their cluster-wide logs. > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > > > Andrey. > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use > > to > > > wait for dead thread's magical resurrection. > > > > > > Why is it unrecoverable? > > > Why we can't restart some thread? > > > Is there some kind of nature limitation to not restart system thread? > > > > > > Actually, distributed systems are designed to overcome some bugs, > thread > > > failure, node failure, for example, isn't it? > > > > if under some circumstances node> stop leads to cascade cluster > crash, > > > then it's a bug > > > > > > How user can know it's a bug? Where this bug should be reported? > > > Do we log it somewhere? > > > Do we warn user before shutdown one or several times? > > > > > > This feature kills user experience literally now. > > > > > > If I would be a user of the product that just shutdown with poor log I > > > would throw this product away. > > > Do we want it for Ignite? > > > > > > From SO discussion I see following error message: ": >>> Possible > > > starvation in striped pool." > > > Are you sure this message are clear for Ignite user(not Ignite hacker)? > > > What user should do to prevent this error in future? > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov
[jira] [Created] (IGNITE-11629) Cassandra dependencies missing from deliverable
Ilya Kasnacheev created IGNITE-11629: Summary: Cassandra dependencies missing from deliverable Key: IGNITE-11629 URL: https://issues.apache.org/jira/browse/IGNITE-11629 Project: Ignite Issue Type: Bug Components: cassandra Affects Versions: 2.7 Reporter: Ilya Kasnacheev Assignee: Ilya Kasnacheev After IGNITE-9046 we lack an explicit netty-resolver dependency for ignite-cassandra-store module. This means that tests still run, project can be made working by fixing dependencies, but apache-ignite-bin deliverable's libs/optional/ignite-cassandra-store does not contain all required depencencies since we only put explicit ones there. Need to add this dependency explicitly, check that it works. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-11628) Document the possibility to use JAR files in UriDeploymentSpi
Denis Mekhanikov created IGNITE-11628: - Summary: Document the possibility to use JAR files in UriDeploymentSpi Key: IGNITE-11628 URL: https://issues.apache.org/jira/browse/IGNITE-11628 Project: Ignite Issue Type: Task Components: documentation Reporter: Denis Mekhanikov Assignee: Artem Budnikov Fix For: 2.8 {{UriDeploymentSpi}} got a possibility to support regular JAR files along with GARs in https://issues.apache.org/jira/browse/IGNITE-11380 This possibility should be reflected in the documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: GridDhtInvalidPartitionException takes the cluster down
Nikolay, Feel free to suggest better error messages to indicate internal/critical failures. User actions in response to critical failures are rather limited: mail to user-list or maybe file an issue. As for repetitive warnings, it makes sense, but requires additional stuff to deliver such signals, mere spamming to log will not have an effect. Anyway, when experienced committers suggest to disable failure handling and hide existing issues, I feel as if they are pulling my leg. Best regards, Andrey Kuznetsov. вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org: > Andrey. > > > the thread can be made non-critical, and we can restart it every time it > dies > > Why we can't restart critical thread? > What is the root difference between critical and non critical threads? > > > It's much simpler to catch and handle all exceptions in critical threads > > I don't agree with you. > We develop Ignite not because it simple! > We must spend extra time to made it robust and resilient to the failures. > > > Failure handling is a last-chance tool that reveals internal Ignite > errors > > 100% agree with you: overcome, but not hide. > > Logging stack trace with proper explanation is not hiding. > Killing nodes and whole cluster is not "handling". > > > As far as I see from user-list messages, our users are qualified enough > to provide necessary information from their cluster-wide logs. > > We shouldn't develop our product only for users who are able to read Ignite > sources to decrypt the fail reason behind "starvation in stripped pool" > > Some of my questions remain unanswered :) : > > 1. How user can know it's an Ignite bug? Where this bug should be reported? > 2. Do we log it somewhere? > 3. Do we warn user before shutdown several times? > 4. "starvation in stripped pool" I think it's not clear error message. > Let's make it more specific! > 5. Let's write to the user log - what he or she should do to prevent this > error in future? > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > > > Nikolay, > > > > > Why we can't restart some thread? > > Technically, we can. It's just matter of design: the thread can be made > > non-critical, and we can restart it every time it dies. But such design > > looks poor to me. It's much simpler to catch and handle all exceptions in > > critical threads. Failure handling is a last-chance tool that reveals > > internal Ignite errors. It's not pleasant for us when users see these > > errors, but it's better than hiding. > > > > > Actually, distributed systems are designed to overcome some bugs, > thread > > failure, node failure, for example, isn't it? > > 100% agree with you: overcome, but not hide. > > > > > How user can know it's a bug? Where this bug should be reported? > > As far as I see from user-list messages, our users are qualified enough > to > > provide necessary information from their cluster-wide logs. > > > > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > > > Andrey. > > > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use > > to > > > wait for dead thread's magical resurrection. > > > > > > Why is it unrecoverable? > > > Why we can't restart some thread? > > > Is there some kind of nature limitation to not restart system thread? > > > > > > Actually, distributed systems are designed to overcome some bugs, > thread > > > failure, node failure, for example, isn't it? > > > > if under some circumstances node> stop leads to cascade cluster > crash, > > > then it's a bug > > > > > > How user can know it's a bug? Where this bug should be reported? > > > Do we log it somewhere? > > > Do we warn user before shutdown one or several times? > > > > > > This feature kills user experience literally now. > > > > > > If I would be a user of the product that just shutdown with poor log I > > > would throw this product away. > > > Do we want it for Ignite? > > > > > > From SO discussion I see following error message: ": >>> Possible > > > starvation in striped pool." > > > Are you sure this message are clear for Ignite user(not Ignite hacker)? > > > What user should do to prevent this error in future? > > > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет: > > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I > don't > > > like > > > > this behavior, but it may be useful sometimes: "frozen" threads have > a > > > > chance to become active again after load decreases. As for > > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to > wait > > > for > > > > dead thread's magical resurrection. Then, if under some circumstances > > > node > > > > stop leads to cascade cluster crash, then it's a bug, and it should > be > > > > fixed. Once and for all. Instead of hiding the flaw we have in the > > > product. > > > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh >: > > > > > > > > > + 1 for having the default settings revisited. > > > > > I understand Andrey's reasonings, but sometimes
Ignite 2.7.5 Release scope
I suppose this ticket [1] : is very useful too. [1] https://issues.apache.org/jira/browse/IGNITE-10873 [ CorruptedTreeException during simultaneous cache put operations ] > > >--- Forwarded message --- >From: "Alexey Goncharuk" < alexey.goncha...@gmail.com > >To: dev < dev@ignite.apache.org > >Cc: >Subject: Re: Ignite 2.7.5 Release scope >Date: Tue, 26 Mar 2019 13:42:59 +0300 > >Hello Ilya, > >I do not see any issues with the mentioned test. I see the following output >in the logs: > >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>> >Stopping test: >TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode >in 37768 ms <<< >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> >Stopping test class: TcpDiscoveryCoordinatorFailureTest <<< >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> >Starting test class: IgniteClientConnectTest <<< > >The issue with Windows may be long connection timeouts, in this case we >should either split the suite into multiple ones or decrease the SPI >timeouts. > >пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev < ilya.kasnach...@gmail.com >: > >> Hello! >> >> It seems that I can no longer test this case, on account of >> >> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized >> hanging every time under Java 11 on Windows. >> >> Alexey, Ivan, can you please take a look? >> >> >> >> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ >> >> Regards, >> >> -- >> Ilya Kasnacheev >> >> >> пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev < ilya.kasnach...@gmail.com >: >> >> > Hello! >> > >> > Basically there is a test that explicitly highlights this problem, that >> is >> > running SSL tests on Windows + Java 11. They will hang on Master but >> pass >> > with this patch. >> > >> > I have started that on TC, results will probably be available later >> today: >> > >> > >> >> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ >> > (mind the Java version). >> > >> > Regards, >> > -- >> > Ilya Kasnacheev >> > >> > >> > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov < maxmu...@gmail.com >: >> > >> >> Dmitry, Ilya, >> >> >> >> Yes, I've looked through those changes [1] as they can affect my local >> >> PR. Basically, changes look good to me. >> >> >> >> I'm not an expert with CommunicationSpi component, so can miss some >> >> details and I haven't tested these changes under Java 11. One more >> >> thing I'd like to say, I would add additional tests to PR that will >> >> explicitly highlight the problem being solved. >> >> >> >> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 >> >> >> >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov < dpav...@apache.org > >> wrote: >> >> > >> >> > Hi Igniters, >> >> > >> >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy >> wait >> >> on >> >> > processWrite during SSL handshake. >> >> > seems to be blocker cause it is related to Java 11 >> >> > >> >> > I see Maxim M left some comments. Ilya K., Maxim M.were these >> comments >> >> > addressed? >> >> > >> >> > The ticket is in Patch Available. Reviewer needed. Changes located >> in >> >> > GridNioServer. >> >> > >> >> > Sincerely, >> >> > Dmitriy Pavlov >> >> > >> >> > P.S. a quite obvious ticket came to sope, as well: >> >> > https://issues.apache.org/jira/browse/IGNITE-11600 >> >> > >> >> > >> >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov < mr.wei...@gmail.com >: >> >> > >> >> > > Huge +1 >> >> > > >> >> > > Will try to add new JDK in nearest time to our Teamcity. >> >> > > >> >> > > >> >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov < dpav...@apache.org > >> >> wrote: >> >> > > > >> >> > > > Hi Igniters, >> >> > > > >> >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our >> new >> >> tests >> >> > > > scripts with a couple of Java builds. WDYT? >> >> > > > >> >> > > > Sincerely, >> >> > > > Dmitriy Pavlov >> >> > > > >> >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov >> < dpav...@apache.org >: >> >> > > > >> >> > > >> Hi Ignite Developers, >> >> > > >> >> >> > > >> In a separate discussion, I've shared a log with all commits. >> >> > > >> >> >> > > >> As far as I can see, nobody removed commits from this sheet, so >> the >> >> > > scope >> >> > > >> of release will be discussed in another way: only explicitly >> >> declared >> >> > > >> commits will be cherry-picked. >> >> > > >> >> >> > > >> Sincerely, >> >> > > >> Dmitriy Pavlov >> >> > > >> >> >> > > >> >> > > >> >> >> > -- Zhenya Stanilovsky
Re: Ignite 2.7.5 Release scope
Hello Ilya, I do not see any issues with the mentioned test. I see the following output in the logs: [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>> Stopping test: TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode in 37768 ms <<< [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> Stopping test class: TcpDiscoveryCoordinatorFailureTest <<< [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>> Starting test class: IgniteClientConnectTest <<< The issue with Windows may be long connection timeouts, in this case we should either split the suite into multiple ones or decrease the SPI timeouts. пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev : > Hello! > > It seems that I can no longer test this case, on account of > > TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized > hanging every time under Java 11 on Windows. > > Alexey, Ivan, can you please take a look? > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > Regards, > > -- > Ilya Kasnacheev > > > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev : > > > Hello! > > > > Basically there is a test that explicitly highlights this problem, that > is > > running SSL tests on Windows + Java 11. They will hang on Master but pass > > with this patch. > > > > I have started that on TC, results will probably be available later > today: > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > (mind the Java version). > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov : > > > >> Dmitry, Ilya, > >> > >> Yes, I've looked through those changes [1] as they can affect my local > >> PR. Basically, changes look good to me. > >> > >> I'm not an expert with CommunicationSpi component, so can miss some > >> details and I haven't tested these changes under Java 11. One more > >> thing I'd like to say, I would add additional tests to PR that will > >> explicitly highlight the problem being solved. > >> > >> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 > >> > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov > wrote: > >> > > >> > Hi Igniters, > >> > > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy > wait > >> on > >> > processWrite during SSL handshake. > >> > seems to be blocker cause it is related to Java 11 > >> > > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these comments > >> > addressed? > >> > > >> > The ticket is in Patch Available. Reviewer needed. Changes located in > >> > GridNioServer. > >> > > >> > Sincerely, > >> > Dmitriy Pavlov > >> > > >> > P.S. a quite obvious ticket came to sope, as well: > >> > https://issues.apache.org/jira/browse/IGNITE-11600 > >> > > >> > > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov : > >> > > >> > > Huge +1 > >> > > > >> > > Will try to add new JDK in nearest time to our Teamcity. > >> > > > >> > > > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov > >> wrote: > >> > > > > >> > > > Hi Igniters, > >> > > > > >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our > new > >> tests > >> > > > scripts with a couple of Java builds. WDYT? > >> > > > > >> > > > Sincerely, > >> > > > Dmitriy Pavlov > >> > > > > >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov : > >> > > > > >> > > >> Hi Ignite Developers, > >> > > >> > >> > > >> In a separate discussion, I've shared a log with all commits. > >> > > >> > >> > > >> As far as I can see, nobody removed commits from this sheet, so > the > >> > > scope > >> > > >> of release will be discussed in another way: only explicitly > >> declared > >> > > >> commits will be cherry-picked. > >> > > >> > >> > > >> Sincerely, > >> > > >> Dmitriy Pavlov > >> > > >> > >> > > > >> > > > >> > > >
Re: GridDhtInvalidPartitionException takes the cluster down
Andrey. > the thread can be made non-critical, and we can restart it every time it dies Why we can't restart critical thread? What is the root difference between critical and non critical threads? > It's much simpler to catch and handle all exceptions in critical threads I don't agree with you. We develop Ignite not because it simple! We must spend extra time to made it robust and resilient to the failures. > Failure handling is a last-chance tool that reveals internal Ignite errors > 100% agree with you: overcome, but not hide. Logging stack trace with proper explanation is not hiding. Killing nodes and whole cluster is not "handling". > As far as I see from user-list messages, our users are qualified enough to provide necessary information from their cluster-wide logs. We shouldn't develop our product only for users who are able to read Ignite sources to decrypt the fail reason behind "starvation in stripped pool" Some of my questions remain unanswered :) : 1. How user can know it's an Ignite bug? Where this bug should be reported? 2. Do we log it somewhere? 3. Do we warn user before shutdown several times? 4. "starvation in stripped pool" I think it's not clear error message. Let's make it more specific! 5. Let's write to the user log - what he or she should do to prevent this error in future? вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov : > Nikolay, > > > Why we can't restart some thread? > Technically, we can. It's just matter of design: the thread can be made > non-critical, and we can restart it every time it dies. But such design > looks poor to me. It's much simpler to catch and handle all exceptions in > critical threads. Failure handling is a last-chance tool that reveals > internal Ignite errors. It's not pleasant for us when users see these > errors, but it's better than hiding. > > > Actually, distributed systems are designed to overcome some bugs, thread > failure, node failure, for example, isn't it? > 100% agree with you: overcome, but not hide. > > > How user can know it's a bug? Where this bug should be reported? > As far as I see from user-list messages, our users are qualified enough to > provide necessary information from their cluster-wide logs. > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > > > Andrey. > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use > to > > wait for dead thread's magical resurrection. > > > > Why is it unrecoverable? > > Why we can't restart some thread? > > Is there some kind of nature limitation to not restart system thread? > > > > Actually, distributed systems are designed to overcome some bugs, thread > > failure, node failure, for example, isn't it? > > > if under some circumstances node> stop leads to cascade cluster crash, > > then it's a bug > > > > How user can know it's a bug? Where this bug should be reported? > > Do we log it somewhere? > > Do we warn user before shutdown one or several times? > > > > This feature kills user experience literally now. > > > > If I would be a user of the product that just shutdown with poor log I > > would throw this product away. > > Do we want it for Ignite? > > > > From SO discussion I see following error message: ": >>> Possible > > starvation in striped pool." > > Are you sure this message are clear for Ignite user(not Ignite hacker)? > > What user should do to prevent this error in future? > > > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет: > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't > > like > > > this behavior, but it may be useful sometimes: "frozen" threads have a > > > chance to become active again after load decreases. As for > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait > > for > > > dead thread's magical resurrection. Then, if under some circumstances > > node > > > stop leads to cascade cluster crash, then it's a bug, and it should be > > > fixed. Once and for all. Instead of hiding the flaw we have in the > > product. > > > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh : > > > > > > > + 1 for having the default settings revisited. > > > > I understand Andrey's reasonings, but sometimes taking nodes down is > > too > > > > radical (as in my case it was GridDhtInvalidPartitionException which > > could > > > > be ignored for a while when rebalancing <- I might be wrong here). > > > > > > > > -- Roman > > > > > > > > > > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda < > > > > dma...@apache.org> wrote: > > > > > > > > pNikolay, > > > > Thanks for kicking off this discussion. Surprisingly, planned to > start > > a > > > > similar one today and incidentally came across this thread. > > > > Agree that the failure handler should be off by default or the > default > > > > settings have to be revisited. That's true that people are > complaining > > of > > > > nodes shutdowns even on moderate workloads. For instance, that's the > > most > > > > recent
[jira] [Created] (IGNITE-11627) Test CheckpointFreeListTest.testRestoreFreeListCorrectlyAfterRandomStop always fails in DiskCompression suite
Anton Kalashnikov created IGNITE-11627: -- Summary: Test CheckpointFreeListTest.testRestoreFreeListCorrectlyAfterRandomStop always fails in DiskCompression suite Key: IGNITE-11627 URL: https://issues.apache.org/jira/browse/IGNITE-11627 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=5828425958400232265=testDetails_IgniteTests24Java8=%3Cdefault%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Ignite 2.7.5 Release scope
Hello! If you ask me I vote +0,5 either, I am not entirely confident but I answer a huge volume of questions on userlist which boil down to prematory SYSTEM_WORKER_TERMINATION. Regards, -- Ilya Kasnacheev вт, 26 мар. 2019 г. в 11:24, Dmitriy Pavlov : > +0.5 from me from release point of view. If community agrees with solution, > I can cherry pick fix later. > > вт, 26 мар. 2019 г., 8:59 Roman Shtykh : > > > Andrey, hmm, I don't think putting back the behavior (if it's safe) we > > used to have with all those exceptions being logged etc. is hiding. I > would > > never propose something like that. > > Btw, I have fixed the issue. If it looks good let's merge. > > > > -- Roman > > > > > > On Tuesday, March 26, 2019, 2:46:08 p.m. GMT+9, Andrey Kuznetsov < > > stku...@gmail.com> wrote: > > > > Roman, I think the worst thing we can do is to hide the bug you > > discovered. The sane options are either fix it urgently or classify it as > > non-critical and postpone. > > вт, 26 мар. 2019 г. в 05:13, Roman Shtykh : > > > > Guys, what do you think about disabling SYSTEM_WORKER_TERMINATION > > (introduced with IEP-14) before "cluster shutdown" bugs are fixed, as > > suggested by Nikolay I. in "GridDhtInvalidPartitionException takes the > > cluster down" thread? > > > > -- Roman > > > > > > On Tuesday, March 26, 2019, 3:41:29 a.m. GMT+9, Dmitriy Pavlov < > > dpav...@apache.org> wrote: > > > > Hi Ignite Developers, > > > > So because nobody raised any feature I would like to call for scope > freeze > > for 2.7.5. > > > > The scope is limited with corruption fix, Java 11 issues addressed. > > https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.7.5 > > > > Also, launch scripts will be tested for Java 12. > > > > We entered the Rampdown phase. See more info in > > https://cwiki.apache.org/confluence/display/IGNITE/Release+Process > > > > Issues can be added to the scope only through discussion. > > > > Sincerely, > > Dmitriy Pavlov > > > > пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev >: > > > > > Hello! > > > > > > It seems that I can no longer test this case, on account of > > > > > > > > > TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized > > > hanging every time under Java 11 on Windows. > > > > > > Alexey, Ivan, can you please take a look? > > > > > > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > > > > Regards, > > > > > > -- > > > Ilya Kasnacheev > > > > > > > > > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev < > ilya.kasnach...@gmail.com > > >: > > > > > > > Hello! > > > > > > > > Basically there is a test that explicitly highlights this problem, > that > > > is > > > > running SSL tests on Windows + Java 11. They will hang on Master but > > pass > > > > with this patch. > > > > > > > > I have started that on TC, results will probably be available later > > > today: > > > > > > > > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > > (mind the Java version). > > > > > > > > Regards, > > > > -- > > > > Ilya Kasnacheev > > > > > > > > > > > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov : > > > > > > > >> Dmitry, Ilya, > > > >> > > > >> Yes, I've looked through those changes [1] as they can affect my > local > > > >> PR. Basically, changes look good to me. > > > >> > > > >> I'm not an expert with CommunicationSpi component, so can miss some > > > >> details and I haven't tested these changes under Java 11. One more > > > >> thing I'd like to say, I would add additional tests to PR that will > > > >> explicitly highlight the problem being solved. > > > >> > > > >> > > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 > > > >> > > > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov > > > wrote: > > > >> > > > > >> > Hi Igniters, > > > >> > > > > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy > > > wait > > > >> on > > > >> > processWrite during SSL handshake. > > > >> > seems to be blocker cause it is related to Java 11 > > > >> > > > > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these > > comments > > > >> > addressed? > > > >> > > > > >> > The ticket is in Patch Available. Reviewer needed. Changes located > > in > > > >> > GridNioServer. > > > >> > > > > >> > Sincerely, > > > >> > Dmitriy Pavlov > > > >> > > > > >> > P.S. a quite obvious ticket came to sope, as well: > > > >> > https://issues.apache.org/jira/browse/IGNITE-11600 > > > >> > > > > >> > > > > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov : > > > >> > > > > >> > > Huge +1 > > > >> > > > > > >> > > Will try to add new JDK in nearest time to our Teamcity. > > > >> > > > > > >> > > > > > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov > > > >> wrote: > > > >> > > > > > > >> > > > Hi Igniters, > > > >> > > > > > > >> > > > Meanwhile,
Re: UriDeploymentSpi and GAR files
Hello! This looked sensible to me so I went forward and merged this change. Regards, -- Ilya Kasnacheev пн, 25 мар. 2019 г. в 17:59, Denis Mekhanikov : > Folks, > > I prepared a patch for the second ticket: > https://github.com/apache/ignite/pull/6177 > Ilya is concerned, that if you had some JAR files, lying next to your GARs > in a repository, which is referred to over UriDeploymentSpi, then these > JARs will now be loaded as well. So, this is a behaviour change. > I don't think, that this is really a problem. I don't see a simple solution > to this, that wouldn't require an API change. And a complex change would be > an overkill here. > Loading what's located in the repository is pretty natural, so you > shouldn't be surprised, when JARs start loading after an Ignite version > upgrade. > > What do you think? > > Denis > > чт, 21 февр. 2019 г. в 17:48, Denis Mekhanikov : > > > I created the following tickets: > > > > https://issues.apache.org/jira/browse/IGNITE-11379 – drop support of > GARs > > https://issues.apache.org/jira/browse/IGNITE-11380 – support JARs > > https://issues.apache.org/jira/browse/IGNITE-11381 – document ignite.xml > > file format. > > > > Denis > > > > ср, 20 февр. 2019 г. в 12:30, Nikolay Izhikov : > > > >> Hello, Denis. > >> > >> > This XML may contain task descriptors, but I couldn't find any > >> documentation on this format. > >> > This information can be provided in simple JAR files with the same > file > >> structure. > >> > >> I support you proposal. Let's: > >> > >> 1. Support jar files instead of gar. > >> 2. Write down documentation about XML config format. > >> 3. Provide some examples. > >> > >> Can you crate a tickets for it? > >> > >> > >> ср, 20 февр. 2019 г. в 11:49, Denis Mekhanikov : > >> > >> > Denis, > >> > > >> > This XML may contain task descriptors, but I couldn't find any > >> > documentation on this format. > >> > Also it may contain a userVersion [1] parameter, which can be used to > >> force > >> > tasks redeployment in some cases. > >> > > >> > This information can be provided in simple JAR files with the same > file > >> > structure. > >> > There is no need to confuse people and require their packages to have > a > >> GAR > >> > extension. > >> > > >> > Also if you don't specify the task descriptors, then all tasks in the > >> file > >> > will be registered. > >> > So, I doubt, that anybody will bother specifying the descriptors. XML > is > >> > not very user-friendly. > >> > This piece of configuration doesn't seem necessary to me. > >> > > >> > [1] > >> > > >> > > >> > https://apacheignite.readme.io/docs/deployment-modes#section-un-deployment-and-user-versions > >> > > >> > Denis > >> > > >> > ср, 20 февр. 2019 г. в 01:35, Denis Magda : > >> > > >> > > Denis, > >> > > > >> > > What was the purpose of having XML and other files within the GARs? > >> Guess > >> > > it was somehow versioning related - you might have several tasks of > >> the > >> > > same class but different versions running in a cluster. > >> > > > >> > > - > >> > > Denis > >> > > > >> > > > >> > > On Tue, Feb 19, 2019 at 8:40 AM Ilya Kasnacheev < > >> > ilya.kasnach...@gmail.com > >> > > > > >> > > wrote: > >> > > > >> > > > Hello! > >> > > > > >> > > > Yes, I think we should accept plain JARs if anybody needs this at > >> all. > >> > > > Might still keep meta info support for compatibility. > >> > > > > >> > > > Regards, > >> > > > -- > >> > > > Ilya Kasnacheev > >> > > > > >> > > > > >> > > > вт, 19 февр. 2019 г. в 19:38, Denis Mekhanikov < > >> dmekhani...@gmail.com > >> > >: > >> > > > > >> > > > > Hi! > >> > > > > > >> > > > > There is a feature in Ignite called DeploymentSpi [1], that > allows > >> > > adding > >> > > > > and changing implementation of compute tasks without nodes' > >> downtime. > >> > > > > The only usable implementation right now is UriDeploymentSpi > [2], > >> > which > >> > > > > lets you provide classes of compute tasks packaged as an archive > >> of a > >> > > > > special form. And this special form is the worst part. > >> > > > > GAR file is just like a JAR, but with some additional meta info. > >> It > >> > may > >> > > > > contain an XML with description of tasks, a checksum and also > >> > > > dependencies. > >> > > > > > >> > > > > We barely have any tools to build these files, and they can be > >> > replaced > >> > > > > with simple uber-JARs. > >> > > > > The only tool we have right now is IgniteDeploymentGarAntTask, > >> which > >> > is > >> > > > not > >> > > > > documented anywhere, and it's supposed to be used from a > >> > long-forgotten > >> > > > > Apache Ant build system. > >> > > > > > >> > > > > I don't think we need this file format. How about we deprecate > and > >> > > remove > >> > > > > it and make UriDeploymentSpi support plain JARs? > >> > > > > > >> > > > > [1] https://apacheignite.readme.io/docs/deployment-spi > >> > > > > [2] > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >
Re: GridDhtInvalidPartitionException takes the cluster down
Nikolay, > Why we can't restart some thread? Technically, we can. It's just matter of design: the thread can be made non-critical, and we can restart it every time it dies. But such design looks poor to me. It's much simpler to catch and handle all exceptions in critical threads. Failure handling is a last-chance tool that reveals internal Ignite errors. It's not pleasant for us when users see these errors, but it's better than hiding. > Actually, distributed systems are designed to overcome some bugs, thread failure, node failure, for example, isn't it? 100% agree with you: overcome, but not hide. > How user can know it's a bug? Where this bug should be reported? As far as I see from user-list messages, our users are qualified enough to provide necessary information from their cluster-wide logs. вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov : > Andrey. > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to > wait for dead thread's magical resurrection. > > Why is it unrecoverable? > Why we can't restart some thread? > Is there some kind of nature limitation to not restart system thread? > > Actually, distributed systems are designed to overcome some bugs, thread > failure, node failure, for example, isn't it? > > if under some circumstances node> stop leads to cascade cluster crash, > then it's a bug > > How user can know it's a bug? Where this bug should be reported? > Do we log it somewhere? > Do we warn user before shutdown one or several times? > > This feature kills user experience literally now. > > If I would be a user of the product that just shutdown with poor log I > would throw this product away. > Do we want it for Ignite? > > From SO discussion I see following error message: ": >>> Possible > starvation in striped pool." > Are you sure this message are clear for Ignite user(not Ignite hacker)? > What user should do to prevent this error in future? > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет: > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't > like > > this behavior, but it may be useful sometimes: "frozen" threads have a > > chance to become active again after load decreases. As for > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait > for > > dead thread's magical resurrection. Then, if under some circumstances > node > > stop leads to cascade cluster crash, then it's a bug, and it should be > > fixed. Once and for all. Instead of hiding the flaw we have in the > product. > > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh : > > > > > + 1 for having the default settings revisited. > > > I understand Andrey's reasonings, but sometimes taking nodes down is > too > > > radical (as in my case it was GridDhtInvalidPartitionException which > could > > > be ignored for a while when rebalancing <- I might be wrong here). > > > > > > -- Roman > > > > > > > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda < > > > dma...@apache.org> wrote: > > > > > > pNikolay, > > > Thanks for kicking off this discussion. Surprisingly, planned to start > a > > > similar one today and incidentally came across this thread. > > > Agree that the failure handler should be off by default or the default > > > settings have to be revisited. That's true that people are complaining > of > > > nodes shutdowns even on moderate workloads. For instance, that's the > most > > > recent feedback related to slow checkpointing: > > > > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure > > > > > > At a minimum, let's consider the following: > > >- A failure handler needs to provide hints on how to come around the > > > shutdown in the future. Take the checkpointing SO thread above. It's > > > unclear from the logs how to prevent the same situation next time > (suggest > > > parameters for tuning, flash drives, etc). > > >- Is there any protection for a full cluster restart? We need to > > > distinguish a slow cluster from the stuck one. A node removal should > not > > > lead to a meltdown of the whole storage. > > >- Should we enable the failure handler for things like transactions > or > > > PME and have it off for checkpointing and something else? Let's have it > > > enabled for cases when we are 100% certain that a node shutdown is the > > > right thing and print out warnings with suggestions whenever we're not > > > confident that the removal is appropriate. > > > --Denis > > > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura wrote: > > > > > > Failure handlers were introduced in order to avoid cluster hanging and > > > they kill nodes instead. > > > > > > If critical worker was terminated by GridDhtInvalidPartitionException > > > then your node is unable to work anymore. > > > > > > Unexpected cluster shutdown with reasons in logs that failure handlers > > > provide is better than hanging. So answer is NO. We mustn't disable > > >
Re: Ignite 2.7.5 Release scope
+0.5 from me from release point of view. If community agrees with solution, I can cherry pick fix later. вт, 26 мар. 2019 г., 8:59 Roman Shtykh : > Andrey, hmm, I don't think putting back the behavior (if it's safe) we > used to have with all those exceptions being logged etc. is hiding. I would > never propose something like that. > Btw, I have fixed the issue. If it looks good let's merge. > > -- Roman > > > On Tuesday, March 26, 2019, 2:46:08 p.m. GMT+9, Andrey Kuznetsov < > stku...@gmail.com> wrote: > > Roman, I think the worst thing we can do is to hide the bug you > discovered. The sane options are either fix it urgently or classify it as > non-critical and postpone. > вт, 26 мар. 2019 г. в 05:13, Roman Shtykh : > > Guys, what do you think about disabling SYSTEM_WORKER_TERMINATION > (introduced with IEP-14) before "cluster shutdown" bugs are fixed, as > suggested by Nikolay I. in "GridDhtInvalidPartitionException takes the > cluster down" thread? > > -- Roman > > > On Tuesday, March 26, 2019, 3:41:29 a.m. GMT+9, Dmitriy Pavlov < > dpav...@apache.org> wrote: > > Hi Ignite Developers, > > So because nobody raised any feature I would like to call for scope freeze > for 2.7.5. > > The scope is limited with corruption fix, Java 11 issues addressed. > https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.7.5 > > Also, launch scripts will be tested for Java 12. > > We entered the Rampdown phase. See more info in > https://cwiki.apache.org/confluence/display/IGNITE/Release+Process > > Issues can be added to the scope only through discussion. > > Sincerely, > Dmitriy Pavlov > > пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev : > > > Hello! > > > > It seems that I can no longer test this case, on account of > > > > > TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized > > hanging every time under Java 11 on Windows. > > > > Alexey, Ivan, can you please take a look? > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > > Regards, > > > > -- > > Ilya Kasnacheev > > > > > > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev >: > > > > > Hello! > > > > > > Basically there is a test that explicitly highlights this problem, that > > is > > > running SSL tests on Windows + Java 11. They will hang on Master but > pass > > > with this patch. > > > > > > I have started that on TC, results will probably be available later > > today: > > > > > > > > > https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__ > > > (mind the Java version). > > > > > > Regards, > > > -- > > > Ilya Kasnacheev > > > > > > > > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov : > > > > > >> Dmitry, Ilya, > > >> > > >> Yes, I've looked through those changes [1] as they can affect my local > > >> PR. Basically, changes look good to me. > > >> > > >> I'm not an expert with CommunicationSpi component, so can miss some > > >> details and I haven't tested these changes under Java 11. One more > > >> thing I'd like to say, I would add additional tests to PR that will > > >> explicitly highlight the problem being solved. > > >> > > >> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299 > > >> > > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov > > wrote: > > >> > > > >> > Hi Igniters, > > >> > > > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy > > wait > > >> on > > >> > processWrite during SSL handshake. > > >> > seems to be blocker cause it is related to Java 11 > > >> > > > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these > comments > > >> > addressed? > > >> > > > >> > The ticket is in Patch Available. Reviewer needed. Changes located > in > > >> > GridNioServer. > > >> > > > >> > Sincerely, > > >> > Dmitriy Pavlov > > >> > > > >> > P.S. a quite obvious ticket came to sope, as well: > > >> > https://issues.apache.org/jira/browse/IGNITE-11600 > > >> > > > >> > > > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov : > > >> > > > >> > > Huge +1 > > >> > > > > >> > > Will try to add new JDK in nearest time to our Teamcity. > > >> > > > > >> > > > > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov > > >> wrote: > > >> > > > > > >> > > > Hi Igniters, > > >> > > > > > >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our > > new > > >> tests > > >> > > > scripts with a couple of Java builds. WDYT? > > >> > > > > > >> > > > Sincerely, > > >> > > > Dmitriy Pavlov > > >> > > > > > >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov >: > > >> > > > > > >> > > >> Hi Ignite Developers, > > >> > > >> > > >> > > >> In a separate discussion, I've shared a log with all commits. > > >> > > >> > > >> > > >> As far as I can see, nobody removed commits from this sheet, so > > the > > >> > > scope > > >> > > >> of release will be discussed in another way: only explicitly >
Re: GridDhtInvalidPartitionException takes the cluster down
Andrey. > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait > for dead thread's magical resurrection. Why is it unrecoverable? Why we can't restart some thread? Is there some kind of nature limitation to not restart system thread? Actually, distributed systems are designed to overcome some bugs, thread failure, node failure, for example, isn't it? > if under some circumstances node> stop leads to cascade cluster crash, then > it's a bug How user can know it's a bug? Where this bug should be reported? Do we log it somewhere? Do we warn user before shutdown one or several times? This feature kills user experience literally now. If I would be a user of the product that just shutdown with poor log I would throw this product away. Do we want it for Ignite? From SO discussion I see following error message: ": >>> Possible starvation in striped pool." Are you sure this message are clear for Ignite user(not Ignite hacker)? What user should do to prevent this error in future? В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет: > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like > this behavior, but it may be useful sometimes: "frozen" threads have a > chance to become active again after load decreases. As for > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for > dead thread's magical resurrection. Then, if under some circumstances node > stop leads to cascade cluster crash, then it's a bug, and it should be > fixed. Once and for all. Instead of hiding the flaw we have in the product. > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh : > > > + 1 for having the default settings revisited. > > I understand Andrey's reasonings, but sometimes taking nodes down is too > > radical (as in my case it was GridDhtInvalidPartitionException which could > > be ignored for a while when rebalancing <- I might be wrong here). > > > > -- Roman > > > > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda < > > dma...@apache.org> wrote: > > > > pNikolay, > > Thanks for kicking off this discussion. Surprisingly, planned to start a > > similar one today and incidentally came across this thread. > > Agree that the failure handler should be off by default or the default > > settings have to be revisited. That's true that people are complaining of > > nodes shutdowns even on moderate workloads. For instance, that's the most > > recent feedback related to slow checkpointing: > > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure > > > > At a minimum, let's consider the following: > >- A failure handler needs to provide hints on how to come around the > > shutdown in the future. Take the checkpointing SO thread above. It's > > unclear from the logs how to prevent the same situation next time (suggest > > parameters for tuning, flash drives, etc). > >- Is there any protection for a full cluster restart? We need to > > distinguish a slow cluster from the stuck one. A node removal should not > > lead to a meltdown of the whole storage. > >- Should we enable the failure handler for things like transactions or > > PME and have it off for checkpointing and something else? Let's have it > > enabled for cases when we are 100% certain that a node shutdown is the > > right thing and print out warnings with suggestions whenever we're not > > confident that the removal is appropriate. > > --Denis > > > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura wrote: > > > > Failure handlers were introduced in order to avoid cluster hanging and > > they kill nodes instead. > > > > If critical worker was terminated by GridDhtInvalidPartitionException > > then your node is unable to work anymore. > > > > Unexpected cluster shutdown with reasons in logs that failure handlers > > provide is better than hanging. So answer is NO. We mustn't disable > > failure handlers. > > > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh > > wrote: > > > > > > If it sticks to the behavior we had before introducing failure handler, > > > > I think it's better to have disabled instead of killing the whole cluster, > > as in my case, and create a parent issue for those ten bugs.Pavel, thanks > > for the suggestion! > > > > > > > > > > > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov < > > > > nizhi...@apache.org> wrote: > > > > > > Guys. > > > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all. > > > Seems, we have ten or more "cluster shutdown" bugs with this subsystem > > > since it was introduced. > > > > > > Should we disable it by default in 2.7.5? > > > > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko : > > > > > > > Hi Roman, > > > > > > > > I think this InvalidPartition case can be simply handled > > > > in GridCacheTtlManager.expire method. > > > > For workaround a custom FailureHandler can be configured that will not > >
Re: GridDhtInvalidPartitionException takes the cluster down
By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like this behavior, but it may be useful sometimes: "frozen" threads have a chance to become active again after load decreases. As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for dead thread's magical resurrection. Then, if under some circumstances node stop leads to cascade cluster crash, then it's a bug, and it should be fixed. Once and for all. Instead of hiding the flaw we have in the product. вт, 26 мар. 2019 г. в 09:17, Roman Shtykh : > + 1 for having the default settings revisited. > I understand Andrey's reasonings, but sometimes taking nodes down is too > radical (as in my case it was GridDhtInvalidPartitionException which could > be ignored for a while when rebalancing <- I might be wrong here). > > -- Roman > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda < > dma...@apache.org> wrote: > > Nikolay, > Thanks for kicking off this discussion. Surprisingly, planned to start a > similar one today and incidentally came across this thread. > Agree that the failure handler should be off by default or the default > settings have to be revisited. That's true that people are complaining of > nodes shutdowns even on moderate workloads. For instance, that's the most > recent feedback related to slow checkpointing: > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure > > At a minimum, let's consider the following: >- A failure handler needs to provide hints on how to come around the > shutdown in the future. Take the checkpointing SO thread above. It's > unclear from the logs how to prevent the same situation next time (suggest > parameters for tuning, flash drives, etc). >- Is there any protection for a full cluster restart? We need to > distinguish a slow cluster from the stuck one. A node removal should not > lead to a meltdown of the whole storage. >- Should we enable the failure handler for things like transactions or > PME and have it off for checkpointing and something else? Let's have it > enabled for cases when we are 100% certain that a node shutdown is the > right thing and print out warnings with suggestions whenever we're not > confident that the removal is appropriate. > --Denis > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura wrote: > > Failure handlers were introduced in order to avoid cluster hanging and > they kill nodes instead. > > If critical worker was terminated by GridDhtInvalidPartitionException > then your node is unable to work anymore. > > Unexpected cluster shutdown with reasons in logs that failure handlers > provide is better than hanging. So answer is NO. We mustn't disable > failure handlers. > > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh > wrote: > > > > If it sticks to the behavior we had before introducing failure handler, > I think it's better to have disabled instead of killing the whole cluster, > as in my case, and create a parent issue for those ten bugs.Pavel, thanks > for the suggestion! > > > > > > > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov < > nizhi...@apache.org> wrote: > > > > Guys. > > > > We should fix the SYSTEM_WORKER_TERMINATION once and for all. > > Seems, we have ten or more "cluster shutdown" bugs with this subsystem > > since it was introduced. > > > > Should we disable it by default in 2.7.5? > > > > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko : > > > > > Hi Roman, > > > > > > I think this InvalidPartition case can be simply handled > > > in GridCacheTtlManager.expire method. > > > For workaround a custom FailureHandler can be configured that will not > stop > > > a node in case of such exception is thrown. > > > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh : > > > > > > > Igniters, > > > > > > > > Restarting a node when injecting data and having it expired, results > at > > > > GridDhtInvalidPartitionException which terminates nodes with > > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. > This > > > is > > > > really bad and I didn't find the way to save the cluster from > > > disappearing. > > > > I created a JIRA issue > > > https://issues.apache.org/jira/browse/IGNITE-11620 > > > > with a test case. Any clues how to fix this inconsistency when > > > rebalancing? > > > > > > > > -- Roman > > > > > > > > > -- Best regards, Andrey Kuznetsov.
Re: Review IGNITE-11411 'Remove tearDown, setUp from JUnit3TestLegacySupport'
Ivan, I noticed that you updated PR [1] recently and changed an execution flow of setUp and tearDown methods in GridAbstractTest making it similar to what we have in master now. What did not work in an initial implementation? I spent some time seaching the reason why did we introduce JUnit3TestLegacySupport and faced troubles. If we have some special case here it sounds a good idea to add neccessary comments in the code. [1] https://github.com/apache/ignite/pull/6227 вт, 19 мар. 2019 г. в 11:59, Ivan Fedotov : > > Hi Eduard. > > Thank you for your participation in the review. In case of any questions > feel free to ask me. > > вт, 19 мар. 2019 г. в 11:04, Eduard Shangareev >: > > > Hi. > > > > I am interested in. If nobody did it I would do it next week. > > > > On Tue, Mar 19, 2019 at 10:20 AM Ivan Fedotov wrote: > > > > > Hi Igniters! > > > > > > Now I am working on iep-30[1] which is about fully 4->5 migration and > > > includes some moments according to JUnit 3->4 migration. > > > I am on the first stage and finishing ticket about removing tearDown, > > setUp > > > from JUnit3TestLegacySupport [2]. > > > > > > In nutshell: I removed setUp, tearDown from JUnit3TestLegacySupport and > > > replaced them by beforeTest, afterTest in tests where they are used. That > > > brings us to the JUnit5 test scenario because setUp and tearDown are used > > > under Rule annotation in GridAbstractTest. > > > > > > Could somebody review this ticket, please? > > > > > > [1] > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-30%3A+Migration+to+JUnit+5 > > > [2] https://issues.apache.org/jira/browse/IGNITE-11411 > > > > > > -- > > > Ivan Fedotov. > > > > > > ivanan...@gmail.com > > > > > > > > -- > Ivan Fedotov. > > ivanan...@gmail.com -- Best regards, Ivan Pavlukhin
Re: GridDhtInvalidPartitionException takes the cluster down
+ 1 for having the default settings revisited. I understand Andrey's reasonings, but sometimes taking nodes down is too radical (as in my case it was GridDhtInvalidPartitionException which could be ignored for a while when rebalancing <- I might be wrong here). -- Roman On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda wrote: Nikolay, Thanks for kicking off this discussion. Surprisingly, planned to start a similar one today and incidentally came across this thread. Agree that the failure handler should be off by default or the default settings have to be revisited. That's true that people are complaining of nodes shutdowns even on moderate workloads. For instance, that's the most recent feedback related to slow checkpointing:https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure At a minimum, let's consider the following: - A failure handler needs to provide hints on how to come around the shutdown in the future. Take the checkpointing SO thread above. It's unclear from the logs how to prevent the same situation next time (suggest parameters for tuning, flash drives, etc). - Is there any protection for a full cluster restart? We need to distinguish a slow cluster from the stuck one. A node removal should not lead to a meltdown of the whole storage. - Should we enable the failure handler for things like transactions or PME and have it off for checkpointing and something else? Let's have it enabled for cases when we are 100% certain that a node shutdown is the right thing and print out warnings with suggestions whenever we're not confident that the removal is appropriate. --Denis On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura wrote: Failure handlers were introduced in order to avoid cluster hanging and they kill nodes instead. If critical worker was terminated by GridDhtInvalidPartitionException then your node is unable to work anymore. Unexpected cluster shutdown with reasons in logs that failure handlers provide is better than hanging. So answer is NO. We mustn't disable failure handlers. On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh wrote: > > If it sticks to the behavior we had before introducing failure handler, I > think it's better to have disabled instead of killing the whole cluster, as > in my case, and create a parent issue for those ten bugs.Pavel, thanks for > the suggestion! > > > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov > wrote: > > Guys. > > We should fix the SYSTEM_WORKER_TERMINATION once and for all. > Seems, we have ten or more "cluster shutdown" bugs with this subsystem > since it was introduced. > > Should we disable it by default in 2.7.5? > > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko : > > > Hi Roman, > > > > I think this InvalidPartition case can be simply handled > > in GridCacheTtlManager.expire method. > > For workaround a custom FailureHandler can be configured that will not stop > > a node in case of such exception is thrown. > > > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh : > > > > > Igniters, > > > > > > Restarting a node when injecting data and having it expired, results at > > > GridDhtInvalidPartitionException which terminates nodes with > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This > > is > > > really bad and I didn't find the way to save the cluster from > > disappearing. > > > I created a JIRA issue > > https://issues.apache.org/jira/browse/IGNITE-11620 > > > with a test case. Any clues how to fix this inconsistency when > > rebalancing? > > > > > > -- Roman > > > > >