Re: Thin client: transactions support

2019-03-26 Thread Alex Plehanov
Sergey, yes, the close is something like silent rollback. But we can
also implement this on the client side, just using rollback and ignoring
errors in the response.

ср, 27 мар. 2019 г. в 00:04, Sergey Kozlov :

> Nikolay
>
> Am I correctly understand you points:
>
>- close: rollback
>- commit, close: do nothing
>- rollback, close: do what? (I suppose nothing)
>
> Also you assume that after commit/rollback we may need to free some
> resources on server node(s)or just do on client started TX?
>
>
>
> On Tue, Mar 26, 2019 at 10:41 PM Alex Plehanov 
> wrote:
>
> > Sergey, we have the close() method in the thick client, it's behavior is
> > slightly different than rollback() method (it should rollback if the
> > transaction is not committed and do nothing if the transaction is already
> > committed). I think we should support try-with-resource semantics in the
> > thin client and OP_TX_CLOSE will be useful here.
> >
> > Nikolay, suspend/resume didn't work yet for pessimistic transactions.
> Also,
> > the main goal of suspend/resume operations is to support transaction
> > passing between threads. In the thin client, the transaction is bound to
> > the client connection, not client thread. I think passing a transaction
> > between different client connections is not a very useful case.
> >
> > вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov :
> >
> > > Hello, Alex.
> > >
> > > We also have suspend and resume operations.
> > > I think we should support them
> > >
> > > вт, 26 марта 2019 г., 22:07 Sergey Kozlov :
> > >
> > > > Hi
> > > >
> > > > Looks like I missed something but why we need OP_TX_CLOSE operation?
> > > >
> > > > Also I suggest to reserve a code for SAVEPOINT operation which very
> > > useful
> > > > to understand where transaction has been rolled back
> > > >
> > > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov <
> plehanov.a...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello Igniters!
> > > > >
> > > > > I want to pick up the ticket IGNITE-7369 and add transactions
> support
> > > to
> > > > > our thin client implementation.
> > > > > I've looked at our current implementation and have some proposals
> to
> > > > > support transactions:
> > > > >
> > > > > Add new operations to thin client protocol:
> > > > >
> > > > > OP_TX_GET, 4000, Get current transaction for client connection
> > > > > OP_TX_START, 4001, Start a new transaction
> > > > > OP_TX_COMMIT, 4002, Commit transaction
> > > > > OP_TX_ROLLBACK, 4003, Rollback transaction
> > > > > OP_TX_CLOSE, 4004, Close transaction
> > > > >
> > > > > From the client side (java) new interfaces will be added:
> > > > >
> > > > > public interface ClientTransactions {
> > > > > public ClientTransaction txStart();
> > > > > public ClientTransaction txStart(TransactionConcurrency
> > > concurrency,
> > > > > TransactionIsolation isolation);
> > > > > public ClientTransaction txStart(TransactionConcurrency
> > > concurrency,
> > > > > TransactionIsolation isolation, long timeout, int txSize);
> > > > > public ClientTransaction tx(); // Get current connection
> > > transaction
> > > > > public ClientTransactions withLabel(String lb);
> > > > > }
> > > > >
> > > > > public interface ClientTransaction extends AutoCloseable {
> > > > > public IgniteUuid xid(); // Do we need it?
> > > > > public TransactionIsolation isolation();
> > > > > public TransactionConcurrency concurrency();
> > > > > public long timeout();
> > > > > public String label();
> > > > >
> > > > > public void commit();
> > > > > public void rollback();
> > > > > public void close();
> > > > > }
> > > > >
> > > > > From the server side, I think as a first step (while transactions
> > > > > suspend/resume is not fully implemented) we can use the same
> approach
> > > as
> > > > > for JDBC: add a new worker to each ClientRequestHandler and process
> > > > > requests by this worker if the transaction is started explicitly.
> > > > > ClientRequestHandler is bound to client connection, so there will
> be
> > > 1:1
> > > > > relation between client connection and thread, which process
> > operations
> > > > in
> > > > > a transaction.
> > > > >
> > > > > Also, there is a couple of issues I want to discuss:
> > > > >
> > > > > We have overloaded method txStart with a different set of
> arguments.
> > > Some
> > > > > of the arguments may be missing. To pass arguments with OP_TX_START
> > > > > operation we have the next options:
> > > > >  * Serialize full set of arguments and use some value for missing
> > > > > arguments. For example -1 for int/long types and null for string
> > type.
> > > We
> > > > > can't use 0 for int/long types since 0 it's a valid value for
> > > > concurrency,
> > > > > isolation and timeout arguments.
> > > > >  * Serialize arguments as a collection of property-value pairs
> (like
> > > it's
> > > > > implemented now for CacheConfiguration). In this case only
> explicitly
> > > > > 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Denis Magda
Folks, thanks for sharing details and inputs. This is helpful. As long as I
spend a lot of time working with Ignite users, I'll look into this topic in
a couple of days to propose some changes. In the meantime, here is a fresh
one report on the user list:
http://apache-ignite-users.70518.x6.nabble.com/Triggering-Rebalancing-Programmatically-get-error-while-requesting-td27651.html


-
Denis


On Tue, Mar 26, 2019 at 9:04 AM Andrey Gura  wrote:

> CleanupWorker termination can lead to the following effects:
>
> - Queries can retrieve data that have to expired so application will
> behave incorrectly.
> - Memory and/or disc can be overflowed because entries weren't expired.
> - Performance degradation is possible due to unmanageable data set grows.
>
> On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh 
> wrote:
> >
> > Vyacheslav, if you are talking about this particular case I described, I
> believe it has no influence on PME. What could happen is having
> CleanupWorker thread dead (which is not good too).But I believe we are
> talking in a wider scope.
> >
> > -- Roman
> >
> >
> > On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur <
> daradu...@gmail.com> wrote:
> >
> >  In general I agree with Andrey, the handler is very usefull itself. It
> > allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> > processed properly in PME process by worker.
> >
> > Nikolay, look at the code, if Failure Handler hadles an exception - this
> > means that while-true loop in worker’s body has been interrupted with
> > unexpected exception and thread is completed his lifecycle.
> >
> > Without Failure Hanller, in the current case, the cluster will hang,
> > because of unable to participate in PME process.
> >
> > So, the problem is the incorrect handling of the exception in PME’s task
> > wich should be fixed.
> >
> >
> > вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov :
> >
> > > Nikolay,
> > >
> > > Feel free to suggest better error messages to indicate
> internal/critical
> > > failures. User actions in response to critical failures are rather
> limited:
> > > mail to user-list or maybe file an issue. As for repetitive warnings,
> it
> > > makes sense, but requires additional stuff to deliver such signals,
> mere
> > > spamming to log will not have an effect.
> > >
> > > Anyway, when experienced committers suggest to disable failure
> handling and
> > > hide existing issues, I feel as if they are pulling my leg.
> > >
> > > Best regards,
> > > Andrey Kuznetsov.
> > >
> > > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
> > >
> > > > Andrey.
> > > >
> > > > >  the thread can be made non-critical, and we can restart it every
> time
> > > it
> > > > dies
> > > >
> > > > Why we can't restart critical thread?
> > > > What is the root difference between critical and non critical
> threads?
> > > >
> > > > > It's much simpler to catch and handle all exceptions in critical
> > > threads
> > > >
> > > > I don't agree with you.
> > > > We develop Ignite not because it simple!
> > > > We must spend extra time to made it robust and resilient to the
> failures.
> > > >
> > > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > > errors
> > > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > Logging stack trace with proper explanation is not hiding.
> > > > Killing nodes and whole cluster is not "handling".
> > > >
> > > > > As far as I see from user-list messages, our users are qualified
> enough
> > > > to provide necessary information from their cluster-wide logs.
> > > >
> > > > We shouldn't develop our product only for users who are able to read
> > > Ignite
> > > > sources to decrypt the fail reason behind "starvation in stripped
> pool"
> > > >
> > > > Some of my questions remain unanswered :) :
> > > >
> > > > 1. How user can know it's an Ignite bug? Where this bug should be
> > > reported?
> > > > 2. Do we log it somewhere?
> > > > 3. Do we warn user before shutdown several times?
> > > > 4. "starvation in stripped pool" I think it's not clear error
> message.
> > > > Let's make it more specific!
> > > > 5. Let's write to the user log - what he or she should do to prevent
> this
> > > > error in future?
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
> > > >
> > > > > Nikolay,
> > > > >
> > > > > >  Why we can't restart some thread?
> > > > > Technically, we can. It's just matter of design: the thread can be
> made
> > > > > non-critical, and we can restart it every time it dies. But such
> design
> > > > > looks poor to me. It's much simpler to catch and handle all
> exceptions
> > > in
> > > > > critical threads. Failure handling is a last-chance tool that
> reveals
> > > > > internal Ignite errors. It's not pleasant for us when users see
> these
> > > > > errors, but it's better than hiding.
> > > > >
> > > > > >  Actually, distributed systems are designed to overcome some
> bugs,
> > > > thread
> > > > > 

[jira] [Created] (IGNITE-11634) SQL delete query failed to deserialize DmlStatementsProcessor$ModifyingEntryProcessor

2019-03-26 Thread Roman Guseinov (JIRA)
Roman Guseinov created IGNITE-11634:
---

 Summary: SQL delete query failed to deserialize 
DmlStatementsProcessor$ModifyingEntryProcessor
 Key: IGNITE-11634
 URL: https://issues.apache.org/jira/browse/IGNITE-11634
 Project: Ignite
  Issue Type: Bug
  Components: sql
Affects Versions: 2.7
Reporter: Roman Guseinov
Assignee: Roman Guseinov


Here is a stack trace
{code:java}
Exception in thread "main" javax.cache.CacheException: Failed to deserialize 
object 
[typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor]
at 
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:635)
at 
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:574)
at 
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.query(GatewayProtectedCacheProxy.java:356)
at 
org.gridgain.reproducers.sql.JavaSqlClient.deleteRow(JavaSqlClient.java:42)
at org.gridgain.reproducers.sql.JavaSqlClient.run(JavaSqlClient.java:33)
at 
org.gridgain.reproducers.sql.JavaSqlClient.main(JavaSqlClient.java:28)
Caused by: class 
org.apache.ignite.internal.processors.query.IgniteSQLException: Failed to 
deserialize object 
[typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor]
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.doDelete(DmlStatementsProcessor.java:686)
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.processDmlSelectResult(DmlStatementsProcessor.java:587)
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.executeUpdateStatement(DmlStatementsProcessor.java:539)
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.updateSqlFields(DmlStatementsProcessor.java:171)
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.updateSqlFieldsDistributed(DmlStatementsProcessor.java:345)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.doRunPrepared(IgniteH2Indexing.java:1753)
at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.querySqlFields(IgniteH2Indexing.java:1718)
at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2007)
at 
org.apache.ignite.internal.processors.query.GridQueryProcessor$3.applyx(GridQueryProcessor.java:2002)
at 
org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2550)
at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.querySqlFields(GridQueryProcessor.java:2016)
at 
org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.query(IgniteCacheProxyImpl.java:623)
... 5 more
Caused by: java.sql.SQLException: Failed to deserialize object 
[typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor]
at 
org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.processPage(DmlBatchSender.java:225)
at 
org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.sendBatch(DmlBatchSender.java:184)
at 
org.apache.ignite.internal.processors.query.h2.dml.DmlBatchSender.flush(DmlBatchSender.java:135)
at 
org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor.doDelete(DmlStatementsProcessor.java:668)
... 17 more
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to 
deserialize object 
[typeName=org.apache.ignite.internal.processors.query.h2.DmlStatementsProcessor$ModifyingEntryProcessor]
at 
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10045)
at 
org.apache.ignite.internal.processors.cache.GridCacheMessage.unmarshalCollection(GridCacheMessage.java:650)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicFullUpdateRequest.finishUnmarshal(GridNearAtomicFullUpdateRequest.java:405)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.unmarshall(GridCacheIoManager.java:1609)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:586)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
at 

Re: Thin client: transactions support

2019-03-26 Thread Sergey Kozlov
Nikolay

Am I correctly understand you points:

   - close: rollback
   - commit, close: do nothing
   - rollback, close: do what? (I suppose nothing)

Also you assume that after commit/rollback we may need to free some
resources on server node(s)or just do on client started TX?



On Tue, Mar 26, 2019 at 10:41 PM Alex Plehanov 
wrote:

> Sergey, we have the close() method in the thick client, it's behavior is
> slightly different than rollback() method (it should rollback if the
> transaction is not committed and do nothing if the transaction is already
> committed). I think we should support try-with-resource semantics in the
> thin client and OP_TX_CLOSE will be useful here.
>
> Nikolay, suspend/resume didn't work yet for pessimistic transactions. Also,
> the main goal of suspend/resume operations is to support transaction
> passing between threads. In the thin client, the transaction is bound to
> the client connection, not client thread. I think passing a transaction
> between different client connections is not a very useful case.
>
> вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov :
>
> > Hello, Alex.
> >
> > We also have suspend and resume operations.
> > I think we should support them
> >
> > вт, 26 марта 2019 г., 22:07 Sergey Kozlov :
> >
> > > Hi
> > >
> > > Looks like I missed something but why we need OP_TX_CLOSE operation?
> > >
> > > Also I suggest to reserve a code for SAVEPOINT operation which very
> > useful
> > > to understand where transaction has been rolled back
> > >
> > > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov  >
> > > wrote:
> > >
> > > > Hello Igniters!
> > > >
> > > > I want to pick up the ticket IGNITE-7369 and add transactions support
> > to
> > > > our thin client implementation.
> > > > I've looked at our current implementation and have some proposals to
> > > > support transactions:
> > > >
> > > > Add new operations to thin client protocol:
> > > >
> > > > OP_TX_GET, 4000, Get current transaction for client connection
> > > > OP_TX_START, 4001, Start a new transaction
> > > > OP_TX_COMMIT, 4002, Commit transaction
> > > > OP_TX_ROLLBACK, 4003, Rollback transaction
> > > > OP_TX_CLOSE, 4004, Close transaction
> > > >
> > > > From the client side (java) new interfaces will be added:
> > > >
> > > > public interface ClientTransactions {
> > > > public ClientTransaction txStart();
> > > > public ClientTransaction txStart(TransactionConcurrency
> > concurrency,
> > > > TransactionIsolation isolation);
> > > > public ClientTransaction txStart(TransactionConcurrency
> > concurrency,
> > > > TransactionIsolation isolation, long timeout, int txSize);
> > > > public ClientTransaction tx(); // Get current connection
> > transaction
> > > > public ClientTransactions withLabel(String lb);
> > > > }
> > > >
> > > > public interface ClientTransaction extends AutoCloseable {
> > > > public IgniteUuid xid(); // Do we need it?
> > > > public TransactionIsolation isolation();
> > > > public TransactionConcurrency concurrency();
> > > > public long timeout();
> > > > public String label();
> > > >
> > > > public void commit();
> > > > public void rollback();
> > > > public void close();
> > > > }
> > > >
> > > > From the server side, I think as a first step (while transactions
> > > > suspend/resume is not fully implemented) we can use the same approach
> > as
> > > > for JDBC: add a new worker to each ClientRequestHandler and process
> > > > requests by this worker if the transaction is started explicitly.
> > > > ClientRequestHandler is bound to client connection, so there will be
> > 1:1
> > > > relation between client connection and thread, which process
> operations
> > > in
> > > > a transaction.
> > > >
> > > > Also, there is a couple of issues I want to discuss:
> > > >
> > > > We have overloaded method txStart with a different set of arguments.
> > Some
> > > > of the arguments may be missing. To pass arguments with OP_TX_START
> > > > operation we have the next options:
> > > >  * Serialize full set of arguments and use some value for missing
> > > > arguments. For example -1 for int/long types and null for string
> type.
> > We
> > > > can't use 0 for int/long types since 0 it's a valid value for
> > > concurrency,
> > > > isolation and timeout arguments.
> > > >  * Serialize arguments as a collection of property-value pairs (like
> > it's
> > > > implemented now for CacheConfiguration). In this case only explicitly
> > > > provided arguments will be serialized.
> > > > Which way is better? The simplest solution is to use the first option
> > > and I
> > > > want to use it if there were no objections.
> > > >
> > > > Do we need transaction id (xid) on the client side?
> > > > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK,
> > > > OP_TX_CLOSE operations back to the server and do additional check on
> > the
> > > > server side (current transaction id for connection == transaction id
> > > 

Re: Thin client: transactions support

2019-03-26 Thread Alex Plehanov
Sergey, we have the close() method in the thick client, it's behavior is
slightly different than rollback() method (it should rollback if the
transaction is not committed and do nothing if the transaction is already
committed). I think we should support try-with-resource semantics in the
thin client and OP_TX_CLOSE will be useful here.

Nikolay, suspend/resume didn't work yet for pessimistic transactions. Also,
the main goal of suspend/resume operations is to support transaction
passing between threads. In the thin client, the transaction is bound to
the client connection, not client thread. I think passing a transaction
between different client connections is not a very useful case.

вт, 26 мар. 2019 г. в 22:17, Nikolay Izhikov :

> Hello, Alex.
>
> We also have suspend and resume operations.
> I think we should support them
>
> вт, 26 марта 2019 г., 22:07 Sergey Kozlov :
>
> > Hi
> >
> > Looks like I missed something but why we need OP_TX_CLOSE operation?
> >
> > Also I suggest to reserve a code for SAVEPOINT operation which very
> useful
> > to understand where transaction has been rolled back
> >
> > On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov 
> > wrote:
> >
> > > Hello Igniters!
> > >
> > > I want to pick up the ticket IGNITE-7369 and add transactions support
> to
> > > our thin client implementation.
> > > I've looked at our current implementation and have some proposals to
> > > support transactions:
> > >
> > > Add new operations to thin client protocol:
> > >
> > > OP_TX_GET, 4000, Get current transaction for client connection
> > > OP_TX_START, 4001, Start a new transaction
> > > OP_TX_COMMIT, 4002, Commit transaction
> > > OP_TX_ROLLBACK, 4003, Rollback transaction
> > > OP_TX_CLOSE, 4004, Close transaction
> > >
> > > From the client side (java) new interfaces will be added:
> > >
> > > public interface ClientTransactions {
> > > public ClientTransaction txStart();
> > > public ClientTransaction txStart(TransactionConcurrency
> concurrency,
> > > TransactionIsolation isolation);
> > > public ClientTransaction txStart(TransactionConcurrency
> concurrency,
> > > TransactionIsolation isolation, long timeout, int txSize);
> > > public ClientTransaction tx(); // Get current connection
> transaction
> > > public ClientTransactions withLabel(String lb);
> > > }
> > >
> > > public interface ClientTransaction extends AutoCloseable {
> > > public IgniteUuid xid(); // Do we need it?
> > > public TransactionIsolation isolation();
> > > public TransactionConcurrency concurrency();
> > > public long timeout();
> > > public String label();
> > >
> > > public void commit();
> > > public void rollback();
> > > public void close();
> > > }
> > >
> > > From the server side, I think as a first step (while transactions
> > > suspend/resume is not fully implemented) we can use the same approach
> as
> > > for JDBC: add a new worker to each ClientRequestHandler and process
> > > requests by this worker if the transaction is started explicitly.
> > > ClientRequestHandler is bound to client connection, so there will be
> 1:1
> > > relation between client connection and thread, which process operations
> > in
> > > a transaction.
> > >
> > > Also, there is a couple of issues I want to discuss:
> > >
> > > We have overloaded method txStart with a different set of arguments.
> Some
> > > of the arguments may be missing. To pass arguments with OP_TX_START
> > > operation we have the next options:
> > >  * Serialize full set of arguments and use some value for missing
> > > arguments. For example -1 for int/long types and null for string type.
> We
> > > can't use 0 for int/long types since 0 it's a valid value for
> > concurrency,
> > > isolation and timeout arguments.
> > >  * Serialize arguments as a collection of property-value pairs (like
> it's
> > > implemented now for CacheConfiguration). In this case only explicitly
> > > provided arguments will be serialized.
> > > Which way is better? The simplest solution is to use the first option
> > and I
> > > want to use it if there were no objections.
> > >
> > > Do we need transaction id (xid) on the client side?
> > > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK,
> > > OP_TX_CLOSE operations back to the server and do additional check on
> the
> > > server side (current transaction id for connection == transaction id
> > passed
> > > from client side). This, perhaps, will protect clients against some
> > errors
> > > (for example when client try to commit outdated transaction). But
> > > currently, we don't have data type IgniteUuid in thin client protocol.
> Do
> > > we need to add it too?
> > > Also, we can pass xid as a string just to inform the client and do not
> > pass
> > > it back to the server with commit/rollback operation.
> > > Or not to pass xid at all (.NET thick client works this way as far as I
> > > know).
> > >
> > > What do you think?
> > >
> > > ср, 7 мар. 

Re: Thin client: transactions support

2019-03-26 Thread Nikolay Izhikov
Hello, Alex.

We also have suspend and resume operations.
I think we should support them

вт, 26 марта 2019 г., 22:07 Sergey Kozlov :

> Hi
>
> Looks like I missed something but why we need OP_TX_CLOSE operation?
>
> Also I suggest to reserve a code for SAVEPOINT operation which very useful
> to understand where transaction has been rolled back
>
> On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov 
> wrote:
>
> > Hello Igniters!
> >
> > I want to pick up the ticket IGNITE-7369 and add transactions support to
> > our thin client implementation.
> > I've looked at our current implementation and have some proposals to
> > support transactions:
> >
> > Add new operations to thin client protocol:
> >
> > OP_TX_GET, 4000, Get current transaction for client connection
> > OP_TX_START, 4001, Start a new transaction
> > OP_TX_COMMIT, 4002, Commit transaction
> > OP_TX_ROLLBACK, 4003, Rollback transaction
> > OP_TX_CLOSE, 4004, Close transaction
> >
> > From the client side (java) new interfaces will be added:
> >
> > public interface ClientTransactions {
> > public ClientTransaction txStart();
> > public ClientTransaction txStart(TransactionConcurrency concurrency,
> > TransactionIsolation isolation);
> > public ClientTransaction txStart(TransactionConcurrency concurrency,
> > TransactionIsolation isolation, long timeout, int txSize);
> > public ClientTransaction tx(); // Get current connection transaction
> > public ClientTransactions withLabel(String lb);
> > }
> >
> > public interface ClientTransaction extends AutoCloseable {
> > public IgniteUuid xid(); // Do we need it?
> > public TransactionIsolation isolation();
> > public TransactionConcurrency concurrency();
> > public long timeout();
> > public String label();
> >
> > public void commit();
> > public void rollback();
> > public void close();
> > }
> >
> > From the server side, I think as a first step (while transactions
> > suspend/resume is not fully implemented) we can use the same approach as
> > for JDBC: add a new worker to each ClientRequestHandler and process
> > requests by this worker if the transaction is started explicitly.
> > ClientRequestHandler is bound to client connection, so there will be 1:1
> > relation between client connection and thread, which process operations
> in
> > a transaction.
> >
> > Also, there is a couple of issues I want to discuss:
> >
> > We have overloaded method txStart with a different set of arguments. Some
> > of the arguments may be missing. To pass arguments with OP_TX_START
> > operation we have the next options:
> >  * Serialize full set of arguments and use some value for missing
> > arguments. For example -1 for int/long types and null for string type. We
> > can't use 0 for int/long types since 0 it's a valid value for
> concurrency,
> > isolation and timeout arguments.
> >  * Serialize arguments as a collection of property-value pairs (like it's
> > implemented now for CacheConfiguration). In this case only explicitly
> > provided arguments will be serialized.
> > Which way is better? The simplest solution is to use the first option
> and I
> > want to use it if there were no objections.
> >
> > Do we need transaction id (xid) on the client side?
> > If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK,
> > OP_TX_CLOSE operations back to the server and do additional check on the
> > server side (current transaction id for connection == transaction id
> passed
> > from client side). This, perhaps, will protect clients against some
> errors
> > (for example when client try to commit outdated transaction). But
> > currently, we don't have data type IgniteUuid in thin client protocol. Do
> > we need to add it too?
> > Also, we can pass xid as a string just to inform the client and do not
> pass
> > it back to the server with commit/rollback operation.
> > Or not to pass xid at all (.NET thick client works this way as far as I
> > know).
> >
> > What do you think?
> >
> > ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov :
> >
> > > We already have transactions support in JDBC driver in TX SQL branch
> > > (ignite-4191). Currently it is implemented through separate thread,
> which
> > > is not that efficient. Ideally we need to finish decoupling
> transactions
> > > from threads. But alternatively we can change the logic on how we
> assign
> > > thread ID to specific transaction and "impersonate" thin client worker
> > > threads when serving requests from multiple users.
> > >
> > >
> > >
> > > On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda 
> wrote:
> > >
> > > > Here is an original discussion with a reference to the JIRA ticket:
> > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > > com/Re-Transaction-operations-using-the-Ignite-Thin-Client-
> > > > Protocol-td25914.html
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan <
> > dsetrak...@apache.org
> > > >
> > > > wrote:
> > > >

Re: Thin client: transactions support

2019-03-26 Thread Sergey Kozlov
Hi

Looks like I missed something but why we need OP_TX_CLOSE operation?

Also I suggest to reserve a code for SAVEPOINT operation which very useful
to understand where transaction has been rolled back

On Tue, Mar 26, 2019 at 6:07 PM Alex Plehanov 
wrote:

> Hello Igniters!
>
> I want to pick up the ticket IGNITE-7369 and add transactions support to
> our thin client implementation.
> I've looked at our current implementation and have some proposals to
> support transactions:
>
> Add new operations to thin client protocol:
>
> OP_TX_GET, 4000, Get current transaction for client connection
> OP_TX_START, 4001, Start a new transaction
> OP_TX_COMMIT, 4002, Commit transaction
> OP_TX_ROLLBACK, 4003, Rollback transaction
> OP_TX_CLOSE, 4004, Close transaction
>
> From the client side (java) new interfaces will be added:
>
> public interface ClientTransactions {
> public ClientTransaction txStart();
> public ClientTransaction txStart(TransactionConcurrency concurrency,
> TransactionIsolation isolation);
> public ClientTransaction txStart(TransactionConcurrency concurrency,
> TransactionIsolation isolation, long timeout, int txSize);
> public ClientTransaction tx(); // Get current connection transaction
> public ClientTransactions withLabel(String lb);
> }
>
> public interface ClientTransaction extends AutoCloseable {
> public IgniteUuid xid(); // Do we need it?
> public TransactionIsolation isolation();
> public TransactionConcurrency concurrency();
> public long timeout();
> public String label();
>
> public void commit();
> public void rollback();
> public void close();
> }
>
> From the server side, I think as a first step (while transactions
> suspend/resume is not fully implemented) we can use the same approach as
> for JDBC: add a new worker to each ClientRequestHandler and process
> requests by this worker if the transaction is started explicitly.
> ClientRequestHandler is bound to client connection, so there will be 1:1
> relation between client connection and thread, which process operations in
> a transaction.
>
> Also, there is a couple of issues I want to discuss:
>
> We have overloaded method txStart with a different set of arguments. Some
> of the arguments may be missing. To pass arguments with OP_TX_START
> operation we have the next options:
>  * Serialize full set of arguments and use some value for missing
> arguments. For example -1 for int/long types and null for string type. We
> can't use 0 for int/long types since 0 it's a valid value for concurrency,
> isolation and timeout arguments.
>  * Serialize arguments as a collection of property-value pairs (like it's
> implemented now for CacheConfiguration). In this case only explicitly
> provided arguments will be serialized.
> Which way is better? The simplest solution is to use the first option and I
> want to use it if there were no objections.
>
> Do we need transaction id (xid) on the client side?
> If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK,
> OP_TX_CLOSE operations back to the server and do additional check on the
> server side (current transaction id for connection == transaction id passed
> from client side). This, perhaps, will protect clients against some errors
> (for example when client try to commit outdated transaction). But
> currently, we don't have data type IgniteUuid in thin client protocol. Do
> we need to add it too?
> Also, we can pass xid as a string just to inform the client and do not pass
> it back to the server with commit/rollback operation.
> Or not to pass xid at all (.NET thick client works this way as far as I
> know).
>
> What do you think?
>
> ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov :
>
> > We already have transactions support in JDBC driver in TX SQL branch
> > (ignite-4191). Currently it is implemented through separate thread, which
> > is not that efficient. Ideally we need to finish decoupling transactions
> > from threads. But alternatively we can change the logic on how we assign
> > thread ID to specific transaction and "impersonate" thin client worker
> > threads when serving requests from multiple users.
> >
> >
> >
> > On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda  wrote:
> >
> > > Here is an original discussion with a reference to the JIRA ticket:
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > > com/Re-Transaction-operations-using-the-Ignite-Thin-Client-
> > > Protocol-td25914.html
> > >
> > > --
> > > Denis
> > >
> > > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan <
> dsetrak...@apache.org
> > >
> > > wrote:
> > >
> > > > Hi Dmitriy. I don't think we have a design proposal for transaction
> > > support
> > > > in thin clients. Do you mind taking this initiative and creating an
> IEP
> > > on
> > > > Wiki?
> > > >
> > > > D.
> > > >
> > > > On Tue, Mar 6, 2018 at 8:46 AM, Dmitriy Govorukhin <
> > > > dmitriy.govoruk...@gmail.com> wrote:
> > > >
> > > > > Hi, Igniters.
> > > > >
> > > 

Re: Ignite 2.7.5 Release scope

2019-03-26 Thread Dmitriy Pavlov
Hi,

I've cherry-picked this commit. It seems it is critical because it also
fixes storage corruption.

Sincerely,
Dmitriy Pavlov

вт, 26 мар. 2019 г. в 14:14, Zhenya Stanilovsky :

> I suppose this ticket [1] : is very useful too.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-10873 [
> CorruptedTreeException during simultaneous cache put operations ]
>
> >
> >
> >--- Forwarded message ---
> >From: "Alexey Goncharuk" < alexey.goncha...@gmail.com >
> >To: dev < dev@ignite.apache.org >
> >Cc:
> >Subject: Re: Ignite 2.7.5 Release scope
> >Date: Tue, 26 Mar 2019 13:42:59 +0300
> >
> >Hello Ilya,
> >
> >I do not see any issues with the mentioned test. I see the following
> output
> >in the logs:
> >
> >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>>
> >Stopping test:
>
> >TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode
> >in 37768 ms <<<
> >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
> >Stopping test class: TcpDiscoveryCoordinatorFailureTest <<<
> >[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
> >Starting test class: IgniteClientConnectTest <<<
> >
> >The issue with Windows may be long connection timeouts, in this case we
> >should either split the suite into multiple ones or decrease the SPI
> >timeouts.
> >
> >пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev < ilya.kasnach...@gmail.com
> >:
> >
> >> Hello!
> >>
> >> It seems that I can no longer test this case, on account of
> >>
> >>
> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
> >> hanging every time under Java 11 on Windows.
> >>
> >> Alexey, Ivan, can you please take a look?
> >>
> >>
> >>
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> >>
> >> Regards,
> >>
> >> --
> >> Ilya Kasnacheev
> >>
> >>
> >> пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev <
> ilya.kasnach...@gmail.com >:
> >>
> >> > Hello!
> >> >
> >> > Basically there is a test that explicitly highlights this problem,
> that
> >> is
> >> > running SSL tests on Windows + Java 11. They will hang on Master but
> >> pass
> >> > with this patch.
> >> >
> >> > I have started that on TC, results will probably be available later
> >> today:
> >> >
> >> >
> >>
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> >> > (mind the Java version).
> >> >
> >> > Regards,
> >> > --
> >> > Ilya Kasnacheev
> >> >
> >> >
> >> > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov < maxmu...@gmail.com >:
> >> >
> >> >> Dmitry, Ilya,
> >> >>
> >> >> Yes, I've looked through those changes [1] as they can affect my
> local
> >> >> PR.  Basically, changes look good to me.
> >> >>
> >> >> I'm not an expert with CommunicationSpi component, so can miss some
> >> >> details and I haven't tested these changes under Java 11. One more
> >> >> thing I'd like to say, I would add additional tests to PR that will
> >> >> explicitly highlight the problem being solved.
> >> >>
> >> >>
> >> >> [1]  https://issues.apache.org/jira/browse/IGNITE-11299
> >> >>
> >> >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov < dpav...@apache.org >
> >> wrote:
> >> >> >
> >> >> > Hi Igniters,
> >> >> >
> >> >> > fix  https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
> >> wait
> >> >> on
> >> >> > processWrite during SSL handshake.
> >> >> > seems to be blocker cause it is related to Java 11
> >> >> >
> >> >> > I see Maxim M left some comments. Ilya K., Maxim M.were these
> >> comments
> >> >> > addressed?
> >> >> >
> >> >> > The ticket is in Patch Available. Reviewer needed. Changes located
> >> in
> >> >> > GridNioServer.
> >> >> >
> >> >> > Sincerely,
> >> >> > Dmitriy Pavlov
> >> >> >
> >> >> > P.S. a quite obvious ticket came to sope, as well:
> >> >> >  https://issues.apache.org/jira/browse/IGNITE-11600
> >> >> >
> >> >> >
> >> >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov < mr.wei...@gmail.com >:
> >> >> >
> >> >> > > Huge +1
> >> >> > >
> >> >> > > Will try to add new JDK in nearest time to our Teamcity.
> >> >> > >
> >> >> > >
> >> >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov < dpav...@apache.org
> >
> >> >> wrote:
> >> >> > > >
> >> >> > > > Hi Igniters,
> >> >> > > >
> >> >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our
> >> new
> >> >> tests
> >> >> > > > scripts with a couple of Java builds. WDYT?
> >> >> > > >
> >> >> > > > Sincerely,
> >> >> > > > Dmitriy Pavlov
> >> >> > > >
> >> >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov
> >> < dpav...@apache.org >:
> >> >> > > >
> >> >> > > >> Hi Ignite Developers,
> >> >> > > >>
> >> >> > > >> In a separate discussion, I've shared a log with all commits.
> >> >> > > >>
> >> >> > > >> As far as I can see, nobody removed commits from this sheet,
> so
> >> the
> >> >> > > scope
> >> >> > > >> of release will be 

Re: Ignite 2.7.5 Release scope

2019-03-26 Thread Ilya Kasnacheev
Hello.

Yes, locally this test seems to pass. However, no luck on TC. Maybe my
commit is positioned on top of especially unlucky HEAD.

Anyway, my point was thatTcpDiscoverySslTrustedUntrustedTest (or any other
intra-node SSL test) is a sufficient test for IGNITE-11299.

It will very reliably hang on Windows/Java 11 without patch and will always
pass with my patch (and TLSv1.2).

So no additional test is needed - we are testing a known regression here.

Regards,
-- 
Ilya Kasnacheev


вт, 26 мар. 2019 г. в 13:43, Alexey Goncharuk :

> Hello Ilya,
>
> I do not see any issues with the mentioned test. I see the following output
> in the logs:
>
> [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>>
> Stopping test:
>
> TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode
> in 37768 ms <<<
> [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
> Stopping test class: TcpDiscoveryCoordinatorFailureTest <<<
> [21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
> Starting test class: IgniteClientConnectTest <<<
>
> The issue with Windows may be long connection timeouts, in this case we
> should either split the suite into multiple ones or decrease the SPI
> timeouts.
>
> пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev :
>
> > Hello!
> >
> > It seems that I can no longer test this case, on account of
> >
> >
> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
> > hanging every time under Java 11 on Windows.
> >
> > Alexey, Ivan, can you please take a look?
> >
> >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
> >
> > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev  >:
> >
> > > Hello!
> > >
> > > Basically there is a test that explicitly highlights this problem, that
> > is
> > > running SSL tests on Windows + Java 11. They will hang on Master but
> pass
> > > with this patch.
> > >
> > > I have started that on TC, results will probably be available later
> > today:
> > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> > > (mind the Java version).
> > >
> > > Regards,
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov :
> > >
> > >> Dmitry, Ilya,
> > >>
> > >> Yes, I've looked through those changes [1] as they can affect my local
> > >> PR.  Basically, changes look good to me.
> > >>
> > >> I'm not an expert with CommunicationSpi component, so can miss some
> > >> details and I haven't tested these changes under Java 11. One more
> > >> thing I'd like to say, I would add additional tests to PR that will
> > >> explicitly highlight the problem being solved.
> > >>
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299
> > >>
> > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov 
> > wrote:
> > >> >
> > >> > Hi Igniters,
> > >> >
> > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
> > wait
> > >> on
> > >> > processWrite during SSL handshake.
> > >> > seems to be blocker cause it is related to Java 11
> > >> >
> > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these
> comments
> > >> > addressed?
> > >> >
> > >> > The ticket is in Patch Available. Reviewer needed. Changes located
> in
> > >> > GridNioServer.
> > >> >
> > >> > Sincerely,
> > >> > Dmitriy Pavlov
> > >> >
> > >> > P.S. a quite obvious ticket came to sope, as well:
> > >> > https://issues.apache.org/jira/browse/IGNITE-11600
> > >> >
> > >> >
> > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov :
> > >> >
> > >> > > Huge +1
> > >> > >
> > >> > > Will try to add new JDK in nearest time to our Teamcity.
> > >> > >
> > >> > >
> > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov 
> > >> wrote:
> > >> > > >
> > >> > > > Hi Igniters,
> > >> > > >
> > >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our
> > new
> > >> tests
> > >> > > > scripts with a couple of Java builds. WDYT?
> > >> > > >
> > >> > > > Sincerely,
> > >> > > > Dmitriy Pavlov
> > >> > > >
> > >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov  >:
> > >> > > >
> > >> > > >> Hi Ignite Developers,
> > >> > > >>
> > >> > > >> In a separate discussion, I've shared a log with all commits.
> > >> > > >>
> > >> > > >> As far as I can see, nobody removed commits from this sheet, so
> > the
> > >> > > scope
> > >> > > >> of release will be discussed in another way: only explicitly
> > >> declared
> > >> > > >> commits will be cherry-picked.
> > >> > > >>
> > >> > > >> Sincerely,
> > >> > > >> Dmitriy Pavlov
> > >> > > >>
> > >> > >
> > >> > >
> > >>
> > >
> >
>


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Andrey Gura
CleanupWorker termination can lead to the following effects:

- Queries can retrieve data that have to expired so application will
behave incorrectly.
- Memory and/or disc can be overflowed because entries weren't expired.
- Performance degradation is possible due to unmanageable data set grows.

On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh  wrote:
>
> Vyacheslav, if you are talking about this particular case I described, I 
> believe it has no influence on PME. What could happen is having CleanupWorker 
> thread dead (which is not good too).But I believe we are talking in a wider 
> scope.
>
> -- Roman
>
>
> On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur 
>  wrote:
>
>  In general I agree with Andrey, the handler is very usefull itself. It
> allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> processed properly in PME process by worker.
>
> Nikolay, look at the code, if Failure Handler hadles an exception - this
> means that while-true loop in worker’s body has been interrupted with
> unexpected exception and thread is completed his lifecycle.
>
> Without Failure Hanller, in the current case, the cluster will hang,
> because of unable to participate in PME process.
>
> So, the problem is the incorrect handling of the exception in PME’s task
> wich should be fixed.
>
>
> вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov :
>
> > Nikolay,
> >
> > Feel free to suggest better error messages to indicate internal/critical
> > failures. User actions in response to critical failures are rather limited:
> > mail to user-list or maybe file an issue. As for repetitive warnings, it
> > makes sense, but requires additional stuff to deliver such signals, mere
> > spamming to log will not have an effect.
> >
> > Anyway, when experienced committers suggest to disable failure handling and
> > hide existing issues, I feel as if they are pulling my leg.
> >
> > Best regards,
> > Andrey Kuznetsov.
> >
> > вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
> >
> > > Andrey.
> > >
> > > >  the thread can be made non-critical, and we can restart it every time
> > it
> > > dies
> > >
> > > Why we can't restart critical thread?
> > > What is the root difference between critical and non critical threads?
> > >
> > > > It's much simpler to catch and handle all exceptions in critical
> > threads
> > >
> > > I don't agree with you.
> > > We develop Ignite not because it simple!
> > > We must spend extra time to made it robust and resilient to the failures.
> > >
> > > > Failure handling is a last-chance tool that reveals internal Ignite
> > > errors
> > > > 100% agree with you: overcome, but not hide.
> > >
> > > Logging stack trace with proper explanation is not hiding.
> > > Killing nodes and whole cluster is not "handling".
> > >
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to provide necessary information from their cluster-wide logs.
> > >
> > > We shouldn't develop our product only for users who are able to read
> > Ignite
> > > sources to decrypt the fail reason behind "starvation in stripped pool"
> > >
> > > Some of my questions remain unanswered :) :
> > >
> > > 1. How user can know it's an Ignite bug? Where this bug should be
> > reported?
> > > 2. Do we log it somewhere?
> > > 3. Do we warn user before shutdown several times?
> > > 4. "starvation in stripped pool" I think it's not clear error message.
> > > Let's make it more specific!
> > > 5. Let's write to the user log - what he or she should do to prevent this
> > > error in future?
> > >
> > >
> > > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
> > >
> > > > Nikolay,
> > > >
> > > > >  Why we can't restart some thread?
> > > > Technically, we can. It's just matter of design: the thread can be made
> > > > non-critical, and we can restart it every time it dies. But such design
> > > > looks poor to me. It's much simpler to catch and handle all exceptions
> > in
> > > > critical threads. Failure handling is a last-chance tool that reveals
> > > > internal Ignite errors. It's not pleasant for us when users see these
> > > > errors, but it's better than hiding.
> > > >
> > > > >  Actually, distributed systems are designed to overcome some bugs,
> > > thread
> > > > failure, node failure, for example, isn't it?
> > > > 100% agree with you: overcome, but not hide.
> > > >
> > > > >  How user can know it's a bug? Where this bug should be reported?
> > > > As far as I see from user-list messages, our users are qualified enough
> > > to
> > > > provide necessary information from their cluster-wide logs.
> > > >
> > > >
> > > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
> > > >
> > > > > Andrey.
> > > > >
> > > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> > use
> > > > to
> > > > > wait for dead thread's magical resurrection.
> > > > >
> > > > > Why is it unrecoverable?
> > > > > Why we can't restart some thread?
> > > > > Is there some kind of 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Andrey Gura
Igniters,

1. First of all, I want to remind you why failure handles were
implemented. Please take a look to IEP-14 [1] and corresponding
discussion on dev-list [2] (quite emotional discussion). This sources
also answer on some questions from previous posts of this topic.

2. Note that the following failure types are ignored by default (BUT
this fixes ARE NOT included to 2.7):
- SYSTEM_WORKER_BLOCKED: Unresponsive critical thread for a long time
is a problem but we don't know why it happened (possibly slow
environment) so we just ignore this failure.
- SYSTEM_CRITICAL_OPERATION_TIMEOUT: At the moment it is related only
with checkpoint read lock acquisition.

So we already have more or less adequate defaults.

3. About SYSTEM_WORKER_TERMINATION failure type.

Restarting thread is very bad idea because we already have system in
undefined state and system behavior is unpredictable from this point.

For example discovery thread is critical part of discovery protocol.
If discovery thread on some node is terminated during discovery
message processing then:
- Protocol is already broken because message will not send to the next
node in the ring, so we can't ignore this failure because whole
cluster will suffer in this case;
- But we can restart thread and even try to process the same message
once again. And what? The same error will happen with high probability
and discovery thread will be terminated again.

4. About enabling the failure handler for things like transactions or
PME and have it off for check pointing and something else.

Failure handler is a general component. It isn't related with some
kind of functionality (e.g. tx, PME or check pointing). We only can to
manage the behavior of configured failure handler in case of
particular failure type. See p.2 above.

5. About providing hints on how to come around the shutdown in the future

I really don't like analogies but I believe it will be appropriate to
our discussion. What kind of hint can provide JVM in case
AssertionError? It is right for failure handler also. Failure handler
is the last resort and only thing than handler can provide is some
information about failure. In our case this information contains
failure context, thread name and thread dump.

6. About protection for a full cluster restart

Failure handler is node local entity. If whole cluster is
restarted/stopped due to a some failure it means only one - on each
cluster node some critical failure happened. It means that we can't
protect cluster from shutting down in current failure model.
More complex failure model can be implemented which will require
decision about node stopping from all cluster nodes (or some subset -
quorum). But it require additional research and discussion.

7. About user experience

Yes, "starvation in stripped pool" message isn't clear enough for...
hmmm... user. But it is definitely clear for developer. And I've no
idea about clear message for user. So... Are you have an idea? You are
welcome!
It is easy to say that something is wrong but it is hard to make it right.

Also I believe that user experience will not better in cases of frozen
cluster instead of failed cluster. And user will not more happy if we
log more messages like "cluster will be stopped". And unfortunately we
can't explain users what he or she should to do in order to prevent
this error in future because we ourselves don't know what to in this
case. Every failure is actually bug that should be investigated and
fixed. Less bugs is the thing that can improve user experience.


Links:

1. 
https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
2. 
http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html

On Tue, Mar 26, 2019 at 4:58 PM Roman Shtykh  wrote:
>
> Vyacheslav, if you are talking about this particular case I described, I 
> believe it has no influence on PME. What could happen is having CleanupWorker 
> thread dead (which is not good too).But I believe we are talking in a wider 
> scope.
>
> -- Roman
>
>
> On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur 
>  wrote:
>
>  In general I agree with Andrey, the handler is very usefull itself. It
> allows us to become know that ‘GridDhtInvalidPartitionException’ is not
> processed properly in PME process by worker.
>
> Nikolay, look at the code, if Failure Handler hadles an exception - this
> means that while-true loop in worker’s body has been interrupted with
> unexpected exception and thread is completed his lifecycle.
>
> Without Failure Hanller, in the current case, the cluster will hang,
> because of unable to participate in PME process.
>
> So, the problem is the incorrect handling of the exception in PME’s task
> wich should be fixed.
>
>
> вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov :
>
> > Nikolay,
> >
> > Feel free to suggest better error messages to indicate internal/critical
> > failures. User actions in 

[jira] [Created] (IGNITE-11633) Fix errors in WAL disabled archive mode documentation

2019-03-26 Thread Alexey Goncharuk (JIRA)
Alexey Goncharuk created IGNITE-11633:
-

 Summary: Fix errors in WAL disabled archive mode documentation
 Key: IGNITE-11633
 URL: https://issues.apache.org/jira/browse/IGNITE-11633
 Project: Ignite
  Issue Type: Task
  Components: documentation
Reporter: Alexey Goncharuk


In 
https://apacheignite.readme.io/docs/write-ahead-log#section-disabling-wal-archiving
 there is an error. The documentation says that " instead, it will overwrite 
the active segments in a cyclical order". In fact, when walWork == walArchive, 
the whole folder behaves as a sequential log, where new files are sequentially 
created (0, 1, 2, 3, ...) and old files are eventually truncated. Also, need to 
clarify the wal size setting in this mode.
Ask [~dpavlov] and [~akalashnikov] for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Thin client: transactions support

2019-03-26 Thread Alex Plehanov
Hello Igniters!

I want to pick up the ticket IGNITE-7369 and add transactions support to
our thin client implementation.
I've looked at our current implementation and have some proposals to
support transactions:

Add new operations to thin client protocol:

OP_TX_GET, 4000, Get current transaction for client connection
OP_TX_START, 4001, Start a new transaction
OP_TX_COMMIT, 4002, Commit transaction
OP_TX_ROLLBACK, 4003, Rollback transaction
OP_TX_CLOSE, 4004, Close transaction

>From the client side (java) new interfaces will be added:

public interface ClientTransactions {
public ClientTransaction txStart();
public ClientTransaction txStart(TransactionConcurrency concurrency,
TransactionIsolation isolation);
public ClientTransaction txStart(TransactionConcurrency concurrency,
TransactionIsolation isolation, long timeout, int txSize);
public ClientTransaction tx(); // Get current connection transaction
public ClientTransactions withLabel(String lb);
}

public interface ClientTransaction extends AutoCloseable {
public IgniteUuid xid(); // Do we need it?
public TransactionIsolation isolation();
public TransactionConcurrency concurrency();
public long timeout();
public String label();

public void commit();
public void rollback();
public void close();
}

>From the server side, I think as a first step (while transactions
suspend/resume is not fully implemented) we can use the same approach as
for JDBC: add a new worker to each ClientRequestHandler and process
requests by this worker if the transaction is started explicitly.
ClientRequestHandler is bound to client connection, so there will be 1:1
relation between client connection and thread, which process operations in
a transaction.

Also, there is a couple of issues I want to discuss:

We have overloaded method txStart with a different set of arguments. Some
of the arguments may be missing. To pass arguments with OP_TX_START
operation we have the next options:
 * Serialize full set of arguments and use some value for missing
arguments. For example -1 for int/long types and null for string type. We
can't use 0 for int/long types since 0 it's a valid value for concurrency,
isolation and timeout arguments.
 * Serialize arguments as a collection of property-value pairs (like it's
implemented now for CacheConfiguration). In this case only explicitly
provided arguments will be serialized.
Which way is better? The simplest solution is to use the first option and I
want to use it if there were no objections.

Do we need transaction id (xid) on the client side?
If yes, we can pass xid along with OP_TX_COMMIT, OP_TX_ROLLBACK,
OP_TX_CLOSE operations back to the server and do additional check on the
server side (current transaction id for connection == transaction id passed
from client side). This, perhaps, will protect clients against some errors
(for example when client try to commit outdated transaction). But
currently, we don't have data type IgniteUuid in thin client protocol. Do
we need to add it too?
Also, we can pass xid as a string just to inform the client and do not pass
it back to the server with commit/rollback operation.
Or not to pass xid at all (.NET thick client works this way as far as I
know).

What do you think?

ср, 7 мар. 2018 г. в 16:22, Vladimir Ozerov :

> We already have transactions support in JDBC driver in TX SQL branch
> (ignite-4191). Currently it is implemented through separate thread, which
> is not that efficient. Ideally we need to finish decoupling transactions
> from threads. But alternatively we can change the logic on how we assign
> thread ID to specific transaction and "impersonate" thin client worker
> threads when serving requests from multiple users.
>
>
>
> On Tue, Mar 6, 2018 at 10:01 PM, Denis Magda  wrote:
>
> > Here is an original discussion with a reference to the JIRA ticket:
> > http://apache-ignite-developers.2346864.n4.nabble.
> > com/Re-Transaction-operations-using-the-Ignite-Thin-Client-
> > Protocol-td25914.html
> >
> > --
> > Denis
> >
> > On Tue, Mar 6, 2018 at 9:18 AM, Dmitriy Setrakyan  >
> > wrote:
> >
> > > Hi Dmitriy. I don't think we have a design proposal for transaction
> > support
> > > in thin clients. Do you mind taking this initiative and creating an IEP
> > on
> > > Wiki?
> > >
> > > D.
> > >
> > > On Tue, Mar 6, 2018 at 8:46 AM, Dmitriy Govorukhin <
> > > dmitriy.govoruk...@gmail.com> wrote:
> > >
> > > > Hi, Igniters.
> > > >
> > > > I've seen a lot of discussions about thin client and binary protocol,
> > > but I
> > > > did not hear anything about transactions support. Do we have some
> draft
> > > for
> > > > this purpose?
> > > >
> > > > As I understand we have several problems:
> > > >
> > > >- thread and transaction have hard related (we use thread-local
> > > variable
> > > >and thread name)
> > > >- we can process only one transaction at the same time in one
> thread
> > > (it
> > > >mean we need 

[jira] [Created] (IGNITE-11632) Node can't start if WAL is corrupted and the wal archiver disabled.

2019-03-26 Thread Stepachev Maksim (JIRA)
Stepachev Maksim created IGNITE-11632:
-

 Summary: Node can't start if WAL is corrupted and the wal archiver 
disabled.
 Key: IGNITE-11632
 URL: https://issues.apache.org/jira/browse/IGNITE-11632
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7, 2.6, 2.5
Reporter: Stepachev Maksim
Assignee: Stepachev Maksim
 Fix For: 2.7, 2.6, 2.5


If you start node without the wal archiver and your last segment page has the 
wrong CRC, the node stops with an exception.
{code:java}
Caused by: class org.apache.ignite.IgniteCheckedException: Failed to read WAL 
record at position: 234728337 size: 268435456
at 
org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV1Serializer.readWithCrc(RecordV1Serializer.java:394)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV2Serializer.readRecord(RecordV2Serializer.java:235)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.AbstractWalRecordsIterator.advanceRecord(AbstractWalRecordsIterator.java:243)
... 23 more
Caused by: class 
org.apache.ignite.internal.processors.cache.persistence.wal.crc.IgniteDataIntegrityViolationException:
 val: -202263192 writtenCrc: 0
at 
org.apache.ignite.internal.processors.cache.persistence.wal.io.FileInput$Crc32CheckingFileInput.close(FileInput.java:106)
at 
org.apache.ignite.internal.processors.cache.persistence.wal.serializer.RecordV1Serializer.readWithCrc(RecordV1Serializer.java:380)
... 25 more
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Roman Shtykh
Vyacheslav, if you are talking about this particular case I described, I 
believe it has no influence on PME. What could happen is having CleanupWorker 
thread dead (which is not good too).But I believe we are talking in a wider 
scope.

-- Roman
 

On Tuesday, March 26, 2019, 10:23:30 p.m. GMT+9, Vyacheslav Daradur 
 wrote:  
 
 In general I agree with Andrey, the handler is very usefull itself. It
allows us to become know that ‘GridDhtInvalidPartitionException’ is not
processed properly in PME process by worker.

Nikolay, look at the code, if Failure Handler hadles an exception - this
means that while-true loop in worker’s body has been interrupted with
unexpected exception and thread is completed his lifecycle.

Without Failure Hanller, in the current case, the cluster will hang,
because of unable to participate in PME process.

So, the problem is the incorrect handling of the exception in PME’s task
wich should be fixed.


вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov :

> Nikolay,
>
> Feel free to suggest better error messages to indicate internal/critical
> failures. User actions in response to critical failures are rather limited:
> mail to user-list or maybe file an issue. As for repetitive warnings, it
> makes sense, but requires additional stuff to deliver such signals, mere
> spamming to log will not have an effect.
>
> Anyway, when experienced committers suggest to disable failure handling and
> hide existing issues, I feel as if they are pulling my leg.
>
> Best regards,
> Andrey Kuznetsov.
>
> вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
>
> > Andrey.
> >
> > >  the thread can be made non-critical, and we can restart it every time
> it
> > dies
> >
> > Why we can't restart critical thread?
> > What is the root difference between critical and non critical threads?
> >
> > > It's much simpler to catch and handle all exceptions in critical
> threads
> >
> > I don't agree with you.
> > We develop Ignite not because it simple!
> > We must spend extra time to made it robust and resilient to the failures.
> >
> > > Failure handling is a last-chance tool that reveals internal Ignite
> > errors
> > > 100% agree with you: overcome, but not hide.
> >
> > Logging stack trace with proper explanation is not hiding.
> > Killing nodes and whole cluster is not "handling".
> >
> > > As far as I see from user-list messages, our users are qualified enough
> > to provide necessary information from their cluster-wide logs.
> >
> > We shouldn't develop our product only for users who are able to read
> Ignite
> > sources to decrypt the fail reason behind "starvation in stripped pool"
> >
> > Some of my questions remain unanswered :) :
> >
> > 1. How user can know it's an Ignite bug? Where this bug should be
> reported?
> > 2. Do we log it somewhere?
> > 3. Do we warn user before shutdown several times?
> > 4. "starvation in stripped pool" I think it's not clear error message.
> > Let's make it more specific!
> > 5. Let's write to the user log - what he or she should do to prevent this
> > error in future?
> >
> >
> > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
> >
> > > Nikolay,
> > >
> > > >  Why we can't restart some thread?
> > > Technically, we can. It's just matter of design: the thread can be made
> > > non-critical, and we can restart it every time it dies. But such design
> > > looks poor to me. It's much simpler to catch and handle all exceptions
> in
> > > critical threads. Failure handling is a last-chance tool that reveals
> > > internal Ignite errors. It's not pleasant for us when users see these
> > > errors, but it's better than hiding.
> > >
> > > >  Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > failure, node failure, for example, isn't it?
> > > 100% agree with you: overcome, but not hide.
> > >
> > > >  How user can know it's a bug? Where this bug should be reported?
> > > As far as I see from user-list messages, our users are qualified enough
> > to
> > > provide necessary information from their cluster-wide logs.
> > >
> > >
> > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
> > >
> > > > Andrey.
> > > >
> > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > wait for dead thread's magical resurrection.
> > > >
> > > > Why is it unrecoverable?
> > > > Why we can't restart some thread?
> > > > Is there some kind of nature limitation to not restart system thread?
> > > >
> > > > Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > > failure, node failure, for example, isn't it?
> > > > > if under some circumstances node> stop leads to cascade cluster
> > crash,
> > > > then it's a bug
> > > >
> > > > How user can know it's a bug? Where this bug should be reported?
> > > > Do we log it somewhere?
> > > > Do we warn user before shutdown one or several times?
> > > >
> > > > This feature kills user experience literally now.
> > > >
> > > > If I would be a user of 

[jira] [Created] (IGNITE-11631) Server node with PDS and SSL fails on start with NPE

2019-03-26 Thread Sergey Antonov (JIRA)
Sergey Antonov created IGNITE-11631:
---

 Summary: Server node with PDS and SSL fails on start with NPE
 Key: IGNITE-11631
 URL: https://issues.apache.org/jira/browse/IGNITE-11631
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.7
Reporter: Sergey Antonov
Assignee: Sergey Antonov
 Fix For: 2.8


Server node fails with NPE, if persistence and SSL are enable. 
Stacktrace:

{code:java}
java.lang.NullPointerException
at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.createSocket(TcpDiscoverySpi.java:1565)
at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.openSocket(TcpDiscoverySpi.java:1503)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.sendMessageDirectly(ServerImpl.java:1309)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.sendJoinRequestMessage(ServerImpl.java:1144)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:957)
at 
org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:422)
at 
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2089)
at 
org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297)
at 
org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.start(GridDiscoveryManager.java:940)
at 
org.apache.ignite.internal.IgniteKernal.startManager(IgniteKernal.java:1743)
at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1085)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:1992)
at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1683)
at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1109)
at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:607)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:984)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:925)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:913)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.startGrid(GridAbstractTest.java:879)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$4.call(GridAbstractTest.java:822)
at 
org.apache.ignite.testframework.GridTestThread.run(GridTestThread.java:84)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11630) Document changes to SQL views

2019-03-26 Thread Vladimir Ozerov (JIRA)
Vladimir Ozerov created IGNITE-11630:


 Summary: Document changes to SQL views
 Key: IGNITE-11630
 URL: https://issues.apache.org/jira/browse/IGNITE-11630
 Project: Ignite
  Issue Type: Task
  Components: sql
Reporter: Vladimir Ozerov
Assignee: Artem Budnikov
 Fix For: 2.8


The following changes were made to our views.

{{CACHE_GROUPS}}
 # {{ID}} -> {{CACHE_GROUP_ID}}
 # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}}

{{LOCAL_CACHE_GROUPS_IO}}
 # {{GROUP_ID}} -> {{CACHE_GROUP_ID}}
 # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}}

{{CACHES}}
# {{NAME}} -> {{CACHE_NAME}}
# {{GROUP_ID}} -> {{CACHE_GROUP_ID}}
# {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}}

{{INDEXES}}
 # {{GROUP_ID}} -> {{CACHE_GROUP_ID}}
 # {{GROUP_NAME}} -> {{CACHE_GROUP_NAME}}

{{NODES}}
# {{ID}} -> {{NODE_ID}}

{{TABLES}}
# Added {{CACHE_GROUP_ID}}
# Added {{CACHE_GROUP_NAME}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Vyacheslav Daradur
In general I agree with Andrey, the handler is very usefull itself. It
allows us to become know that ‘GridDhtInvalidPartitionException’ is not
processed properly in PME process by worker.

Nikolay, look at the code, if Failure Handler hadles an exception - this
means that while-true loop in worker’s body has been interrupted with
unexpected exception and thread is completed his lifecycle.

Without Failure Hanller, in the current case, the cluster will hang,
because of unable to participate in PME process.

So, the problem is the incorrect handling of the exception in PME’s task
wich should be fixed.


вт, 26 марта 2019 г. в 14:24, Andrey Kuznetsov :

> Nikolay,
>
> Feel free to suggest better error messages to indicate internal/critical
> failures. User actions in response to critical failures are rather limited:
> mail to user-list or maybe file an issue. As for repetitive warnings, it
> makes sense, but requires additional stuff to deliver such signals, mere
> spamming to log will not have an effect.
>
> Anyway, when experienced committers suggest to disable failure handling and
> hide existing issues, I feel as if they are pulling my leg.
>
> Best regards,
> Andrey Kuznetsov.
>
> вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:
>
> > Andrey.
> >
> > >  the thread can be made non-critical, and we can restart it every time
> it
> > dies
> >
> > Why we can't restart critical thread?
> > What is the root difference between critical and non critical threads?
> >
> > > It's much simpler to catch and handle all exceptions in critical
> threads
> >
> > I don't agree with you.
> > We develop Ignite not because it simple!
> > We must spend extra time to made it robust and resilient to the failures.
> >
> > > Failure handling is a last-chance tool that reveals internal Ignite
> > errors
> > > 100% agree with you: overcome, but not hide.
> >
> > Logging stack trace with proper explanation is not hiding.
> > Killing nodes and whole cluster is not "handling".
> >
> > > As far as I see from user-list messages, our users are qualified enough
> > to provide necessary information from their cluster-wide logs.
> >
> > We shouldn't develop our product only for users who are able to read
> Ignite
> > sources to decrypt the fail reason behind "starvation in stripped pool"
> >
> > Some of my questions remain unanswered :) :
> >
> > 1. How user can know it's an Ignite bug? Where this bug should be
> reported?
> > 2. Do we log it somewhere?
> > 3. Do we warn user before shutdown several times?
> > 4. "starvation in stripped pool" I think it's not clear error message.
> > Let's make it more specific!
> > 5. Let's write to the user log - what he or she should do to prevent this
> > error in future?
> >
> >
> > вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
> >
> > > Nikolay,
> > >
> > > >  Why we can't restart some thread?
> > > Technically, we can. It's just matter of design: the thread can be made
> > > non-critical, and we can restart it every time it dies. But such design
> > > looks poor to me. It's much simpler to catch and handle all exceptions
> in
> > > critical threads. Failure handling is a last-chance tool that reveals
> > > internal Ignite errors. It's not pleasant for us when users see these
> > > errors, but it's better than hiding.
> > >
> > > >  Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > failure, node failure, for example, isn't it?
> > > 100% agree with you: overcome, but not hide.
> > >
> > > >  How user can know it's a bug? Where this bug should be reported?
> > > As far as I see from user-list messages, our users are qualified enough
> > to
> > > provide necessary information from their cluster-wide logs.
> > >
> > >
> > > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
> > >
> > > > Andrey.
> > > >
> > > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no
> use
> > > to
> > > > wait for dead thread's magical resurrection.
> > > >
> > > > Why is it unrecoverable?
> > > > Why we can't restart some thread?
> > > > Is there some kind of nature limitation to not restart system thread?
> > > >
> > > > Actually, distributed systems are designed to overcome some bugs,
> > thread
> > > > failure, node failure, for example, isn't it?
> > > > > if under some circumstances node> stop leads to cascade cluster
> > crash,
> > > > then it's a bug
> > > >
> > > > How user can know it's a bug? Where this bug should be reported?
> > > > Do we log it somewhere?
> > > > Do we warn user before shutdown one or several times?
> > > >
> > > > This feature kills user experience literally now.
> > > >
> > > > If I would be a user of the product that just shutdown with poor log
> I
> > > > would throw this product away.
> > > > Do we want it for Ignite?
> > > >
> > > > From SO discussion I see following error message: ": >>> Possible
> > > > starvation in striped pool."
> > > > Are you sure this message are clear for Ignite user(not Ignite
> hacker)?
> > > > What 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Roman Shtykh
I do believe failure handling is useful, but it has to be revisited (including 
above-mentioned suggestions) because what we have now is not what Ignite 
promises to do. Disabling it can be a temporal measure until it is 
improved.Andrey, when you say "hiding", I kind of understand you (even if I 
don't think we hide), but with the current behavior it's like doing stress 
tests on users' clusters -- any serious situation/bug can crash the cluster 
and, in its turn, trust in Ignite.
I think this discussion reveals another problem -- we might need something like 
Jepsen tests etc., which hopefully help us find such issues. AFAIK, CockroachDb 
has it running for a couple of years.

-- Roman
 

On Tuesday, March 26, 2019, 8:24:24 p.m. GMT+9, Andrey Kuznetsov 
 wrote:  
 
 Nikolay,

Feel free to suggest better error messages to indicate internal/critical
failures. User actions in response to critical failures are rather limited:
mail to user-list or maybe file an issue. As for repetitive warnings, it
makes sense, but requires additional stuff to deliver such signals, mere
spamming to log will not have an effect.

Anyway, when experienced committers suggest to disable failure handling and
hide existing issues, I feel as if they are pulling my leg.

Best regards,
Andrey Kuznetsov.

вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:

> Andrey.
>
> >  the thread can be made non-critical, and we can restart it every time it
> dies
>
> Why we can't restart critical thread?
> What is the root difference between critical and non critical threads?
>
> > It's much simpler to catch and handle all exceptions in critical threads
>
> I don't agree with you.
> We develop Ignite not because it simple!
> We must spend extra time to made it robust and resilient to the failures.
>
> > Failure handling is a last-chance tool that reveals internal Ignite
> errors
> > 100% agree with you: overcome, but not hide.
>
> Logging stack trace with proper explanation is not hiding.
> Killing nodes and whole cluster is not "handling".
>
> > As far as I see from user-list messages, our users are qualified enough
> to provide necessary information from their cluster-wide logs.
>
> We shouldn't develop our product only for users who are able to read Ignite
> sources to decrypt the fail reason behind "starvation in stripped pool"
>
> Some of my questions remain unanswered :) :
>
> 1. How user can know it's an Ignite bug? Where this bug should be reported?
> 2. Do we log it somewhere?
> 3. Do we warn user before shutdown several times?
> 4. "starvation in stripped pool" I think it's not clear error message.
> Let's make it more specific!
> 5. Let's write to the user log - what he or she should do to prevent this
> error in future?
>
>
> вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
>
> > Nikolay,
> >
> > >  Why we can't restart some thread?
> > Technically, we can. It's just matter of design: the thread can be made
> > non-critical, and we can restart it every time it dies. But such design
> > looks poor to me. It's much simpler to catch and handle all exceptions in
> > critical threads. Failure handling is a last-chance tool that reveals
> > internal Ignite errors. It's not pleasant for us when users see these
> > errors, but it's better than hiding.
> >
> > >  Actually, distributed systems are designed to overcome some bugs,
> thread
> > failure, node failure, for example, isn't it?
> > 100% agree with you: overcome, but not hide.
> >
> > >  How user can know it's a bug? Where this bug should be reported?
> > As far as I see from user-list messages, our users are qualified enough
> to
> > provide necessary information from their cluster-wide logs.
> >
> >
> > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
> >
> > > Andrey.
> > >
> > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > wait for dead thread's magical resurrection.
> > >
> > > Why is it unrecoverable?
> > > Why we can't restart some thread?
> > > Is there some kind of nature limitation to not restart system thread?
> > >
> > > Actually, distributed systems are designed to overcome some bugs,
> thread
> > > failure, node failure, for example, isn't it?
> > > > if under some circumstances node> stop leads to cascade cluster
> crash,
> > > then it's a bug
> > >
> > > How user can know it's a bug? Where this bug should be reported?
> > > Do we log it somewhere?
> > > Do we warn user before shutdown one or several times?
> > >
> > > This feature kills user experience literally now.
> > >
> > > If I would be a user of the product that just shutdown with poor log I
> > > would throw this product away.
> > > Do we want it for Ignite?
> > >
> > > From SO discussion I see following error message: ": >>> Possible
> > > starvation in striped pool."
> > > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > > What user should do to prevent this error in future?
> > >
> > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov 

[jira] [Created] (IGNITE-11629) Cassandra dependencies missing from deliverable

2019-03-26 Thread Ilya Kasnacheev (JIRA)
Ilya Kasnacheev created IGNITE-11629:


 Summary: Cassandra dependencies missing from deliverable
 Key: IGNITE-11629
 URL: https://issues.apache.org/jira/browse/IGNITE-11629
 Project: Ignite
  Issue Type: Bug
  Components: cassandra
Affects Versions: 2.7
Reporter: Ilya Kasnacheev
Assignee: Ilya Kasnacheev


After IGNITE-9046 we lack an explicit netty-resolver dependency for 
ignite-cassandra-store module. This means that tests still run, project can be 
made working by fixing dependencies, but apache-ignite-bin deliverable's 
libs/optional/ignite-cassandra-store does not contain all required depencencies 
since we only put explicit ones there.

Need to add this dependency explicitly, check that it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-11628) Document the possibility to use JAR files in UriDeploymentSpi

2019-03-26 Thread Denis Mekhanikov (JIRA)
Denis Mekhanikov created IGNITE-11628:
-

 Summary: Document the possibility to use JAR files in 
UriDeploymentSpi
 Key: IGNITE-11628
 URL: https://issues.apache.org/jira/browse/IGNITE-11628
 Project: Ignite
  Issue Type: Task
  Components: documentation
Reporter: Denis Mekhanikov
Assignee: Artem Budnikov
 Fix For: 2.8


{{UriDeploymentSpi}} got a possibility to support regular JAR files along with 
GARs in https://issues.apache.org/jira/browse/IGNITE-11380
This possibility should be reflected in the documentation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Andrey Kuznetsov
Nikolay,

Feel free to suggest better error messages to indicate internal/critical
failures. User actions in response to critical failures are rather limited:
mail to user-list or maybe file an issue. As for repetitive warnings, it
makes sense, but requires additional stuff to deliver such signals, mere
spamming to log will not have an effect.

Anyway, when experienced committers suggest to disable failure handling and
hide existing issues, I feel as if they are pulling my leg.

Best regards,
Andrey Kuznetsov.

вт, 26 марта 2019, 13:30 Nikolay Izhikov nizhi...@apache.org:

> Andrey.
>
> >  the thread can be made non-critical, and we can restart it every time it
> dies
>
> Why we can't restart critical thread?
> What is the root difference between critical and non critical threads?
>
> > It's much simpler to catch and handle all exceptions in critical threads
>
> I don't agree with you.
> We develop Ignite not because it simple!
> We must spend extra time to made it robust and resilient to the failures.
>
> > Failure handling is a last-chance tool that reveals internal Ignite
> errors
> > 100% agree with you: overcome, but not hide.
>
> Logging stack trace with proper explanation is not hiding.
> Killing nodes and whole cluster is not "handling".
>
> > As far as I see from user-list messages, our users are qualified enough
> to provide necessary information from their cluster-wide logs.
>
> We shouldn't develop our product only for users who are able to read Ignite
> sources to decrypt the fail reason behind "starvation in stripped pool"
>
> Some of my questions remain unanswered :) :
>
> 1. How user can know it's an Ignite bug? Where this bug should be reported?
> 2. Do we log it somewhere?
> 3. Do we warn user before shutdown several times?
> 4. "starvation in stripped pool" I think it's not clear error message.
> Let's make it more specific!
> 5. Let's write to the user log - what he or she should do to prevent this
> error in future?
>
>
> вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :
>
> > Nikolay,
> >
> > >  Why we can't restart some thread?
> > Technically, we can. It's just matter of design: the thread can be made
> > non-critical, and we can restart it every time it dies. But such design
> > looks poor to me. It's much simpler to catch and handle all exceptions in
> > critical threads. Failure handling is a last-chance tool that reveals
> > internal Ignite errors. It's not pleasant for us when users see these
> > errors, but it's better than hiding.
> >
> > >  Actually, distributed systems are designed to overcome some bugs,
> thread
> > failure, node failure, for example, isn't it?
> > 100% agree with you: overcome, but not hide.
> >
> > >  How user can know it's a bug? Where this bug should be reported?
> > As far as I see from user-list messages, our users are qualified enough
> to
> > provide necessary information from their cluster-wide logs.
> >
> >
> > вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
> >
> > > Andrey.
> > >
> > > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> > to
> > > wait for dead thread's magical resurrection.
> > >
> > > Why is it unrecoverable?
> > > Why we can't restart some thread?
> > > Is there some kind of nature limitation to not restart system thread?
> > >
> > > Actually, distributed systems are designed to overcome some bugs,
> thread
> > > failure, node failure, for example, isn't it?
> > > > if under some circumstances node> stop leads to cascade cluster
> crash,
> > > then it's a bug
> > >
> > > How user can know it's a bug? Where this bug should be reported?
> > > Do we log it somewhere?
> > > Do we warn user before shutdown one or several times?
> > >
> > > This feature kills user experience literally now.
> > >
> > > If I would be a user of the product that just shutdown with poor log I
> > > would throw this product away.
> > > Do we want it for Ignite?
> > >
> > > From SO discussion I see following error message: ": >>> Possible
> > > starvation in striped pool."
> > > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > > What user should do to prevent this error in future?
> > >
> > > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I
> don't
> > > like
> > > > this behavior, but it may be useful sometimes: "frozen" threads have
> a
> > > > chance to become active again after load decreases. As for
> > > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait
> > > for
> > > > dead thread's magical resurrection. Then, if under some circumstances
> > > node
> > > > stop leads to cascade cluster crash, then it's a bug, and it should
> be
> > > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > > product.
> > > >
> > > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh  >:
> > > >
> > > > > + 1 for having the default settings revisited.
> > > > > I understand Andrey's reasonings, but sometimes 

Ignite 2.7.5 Release scope

2019-03-26 Thread Zhenya Stanilovsky
I suppose this ticket [1] : is very useful too.


[1] https://issues.apache.org/jira/browse/IGNITE-10873 [ CorruptedTreeException 
during simultaneous cache put operations ]

>
>
>--- Forwarded message ---
>From: "Alexey Goncharuk" < alexey.goncha...@gmail.com >
>To: dev < dev@ignite.apache.org >
>Cc:
>Subject: Re: Ignite 2.7.5 Release scope
>Date: Tue, 26 Mar 2019 13:42:59 +0300
>
>Hello Ilya,
>
>I do not see any issues with the mentioned test. I see the following output
>in the logs:
>
>[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>>
>Stopping test:
>TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode
>in 37768 ms <<<
>[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
>Stopping test class: TcpDiscoveryCoordinatorFailureTest <<<
>[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
>Starting test class: IgniteClientConnectTest <<<
>
>The issue with Windows may be long connection timeouts, in this case we
>should either split the suite into multiple ones or decrease the SPI
>timeouts.
>
>пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev < ilya.kasnach...@gmail.com >:
>
>> Hello!
>>
>> It seems that I can no longer test this case, on account of
>>
>> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
>> hanging every time under Java 11 on Windows.
>>
>> Alexey, Ivan, can you please take a look?
>>
>>
>>  
>> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
>>
>> Regards,
>>
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev < ilya.kasnach...@gmail.com >:
>>
>> > Hello!
>> >
>> > Basically there is a test that explicitly highlights this problem, that
>> is
>> > running SSL tests on Windows + Java 11. They will hang on Master but 
>> pass
>> > with this patch.
>> >
>> > I have started that on TC, results will probably be available later
>> today:
>> >
>> >
>>  
>> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
>> > (mind the Java version).
>> >
>> > Regards,
>> > --
>> > Ilya Kasnacheev
>> >
>> >
>> > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov < maxmu...@gmail.com >:
>> >
>> >> Dmitry, Ilya,
>> >>
>> >> Yes, I've looked through those changes [1] as they can affect my local
>> >> PR.  Basically, changes look good to me.
>> >>
>> >> I'm not an expert with CommunicationSpi component, so can miss some
>> >> details and I haven't tested these changes under Java 11. One more
>> >> thing I'd like to say, I would add additional tests to PR that will
>> >> explicitly highlight the problem being solved.
>> >>
>> >>
>> >> [1]  https://issues.apache.org/jira/browse/IGNITE-11299
>> >>
>> >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov < dpav...@apache.org >
>> wrote:
>> >> >
>> >> > Hi Igniters,
>> >> >
>> >> > fix  https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
>> wait
>> >> on
>> >> > processWrite during SSL handshake.
>> >> > seems to be blocker cause it is related to Java 11
>> >> >
>> >> > I see Maxim M left some comments. Ilya K., Maxim M.were these 
>> comments
>> >> > addressed?
>> >> >
>> >> > The ticket is in Patch Available. Reviewer needed. Changes located 
>> in
>> >> > GridNioServer.
>> >> >
>> >> > Sincerely,
>> >> > Dmitriy Pavlov
>> >> >
>> >> > P.S. a quite obvious ticket came to sope, as well:
>> >> >  https://issues.apache.org/jira/browse/IGNITE-11600
>> >> >
>> >> >
>> >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov < mr.wei...@gmail.com >:
>> >> >
>> >> > > Huge +1
>> >> > >
>> >> > > Will try to add new JDK in nearest time to our Teamcity.
>> >> > >
>> >> > >
>> >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov < dpav...@apache.org >
>> >> wrote:
>> >> > > >
>> >> > > > Hi Igniters,
>> >> > > >
>> >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our
>> new
>> >> tests
>> >> > > > scripts with a couple of Java builds. WDYT?
>> >> > > >
>> >> > > > Sincerely,
>> >> > > > Dmitriy Pavlov
>> >> > > >
>> >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov 
>> < dpav...@apache.org >:
>> >> > > >
>> >> > > >> Hi Ignite Developers,
>> >> > > >>
>> >> > > >> In a separate discussion, I've shared a log with all commits.
>> >> > > >>
>> >> > > >> As far as I can see, nobody removed commits from this sheet, so
>> the
>> >> > > scope
>> >> > > >> of release will be discussed in another way: only explicitly
>> >> declared
>> >> > > >> commits will be cherry-picked.
>> >> > > >>
>> >> > > >> Sincerely,
>> >> > > >> Dmitriy Pavlov
>> >> > > >>
>> >> > >
>> >> > >
>> >>
>> >


-- 
Zhenya Stanilovsky


Re: Ignite 2.7.5 Release scope

2019-03-26 Thread Alexey Goncharuk
Hello Ilya,

I do not see any issues with the mentioned test. I see the following output
in the logs:

[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,970][INFO ][main][root] >>>
Stopping test:
TcpDiscoveryCoordinatorFailureTest#testCoordinatorFailedNoAddFinishedMessageStartOneNode
in 37768 ms <<<
[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
Stopping test class: TcpDiscoveryCoordinatorFailureTest <<<
[21:41:44] : [Step 4/5] [2019-03-22 21:41:44,971][INFO ][main][root] >>>
Starting test class: IgniteClientConnectTest <<<

The issue with Windows may be long connection timeouts, in this case we
should either split the suite into multiple ones or decrease the SPI
timeouts.

пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev :

> Hello!
>
> It seems that I can no longer test this case, on account of
>
> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
> hanging every time under Java 11 on Windows.
>
> Alexey, Ivan, can you please take a look?
>
>
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
>
> Regards,
>
> --
> Ilya Kasnacheev
>
>
> пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev :
>
> > Hello!
> >
> > Basically there is a test that explicitly highlights this problem, that
> is
> > running SSL tests on Windows + Java 11. They will hang on Master but pass
> > with this patch.
> >
> > I have started that on TC, results will probably be available later
> today:
> >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> > (mind the Java version).
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> >
> > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov :
> >
> >> Dmitry, Ilya,
> >>
> >> Yes, I've looked through those changes [1] as they can affect my local
> >> PR.  Basically, changes look good to me.
> >>
> >> I'm not an expert with CommunicationSpi component, so can miss some
> >> details and I haven't tested these changes under Java 11. One more
> >> thing I'd like to say, I would add additional tests to PR that will
> >> explicitly highlight the problem being solved.
> >>
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-11299
> >>
> >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov 
> wrote:
> >> >
> >> > Hi Igniters,
> >> >
> >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
> wait
> >> on
> >> > processWrite during SSL handshake.
> >> > seems to be blocker cause it is related to Java 11
> >> >
> >> > I see Maxim M left some comments. Ilya K., Maxim M.were these comments
> >> > addressed?
> >> >
> >> > The ticket is in Patch Available. Reviewer needed. Changes located in
> >> > GridNioServer.
> >> >
> >> > Sincerely,
> >> > Dmitriy Pavlov
> >> >
> >> > P.S. a quite obvious ticket came to sope, as well:
> >> > https://issues.apache.org/jira/browse/IGNITE-11600
> >> >
> >> >
> >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov :
> >> >
> >> > > Huge +1
> >> > >
> >> > > Will try to add new JDK in nearest time to our Teamcity.
> >> > >
> >> > >
> >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov 
> >> wrote:
> >> > > >
> >> > > > Hi Igniters,
> >> > > >
> >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our
> new
> >> tests
> >> > > > scripts with a couple of Java builds. WDYT?
> >> > > >
> >> > > > Sincerely,
> >> > > > Dmitriy Pavlov
> >> > > >
> >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov :
> >> > > >
> >> > > >> Hi Ignite Developers,
> >> > > >>
> >> > > >> In a separate discussion, I've shared a log with all commits.
> >> > > >>
> >> > > >> As far as I can see, nobody removed commits from this sheet, so
> the
> >> > > scope
> >> > > >> of release will be discussed in another way: only explicitly
> >> declared
> >> > > >> commits will be cherry-picked.
> >> > > >>
> >> > > >> Sincerely,
> >> > > >> Dmitriy Pavlov
> >> > > >>
> >> > >
> >> > >
> >>
> >
>


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Nikolay Izhikov
Andrey.

>  the thread can be made non-critical, and we can restart it every time it
dies

Why we can't restart critical thread?
What is the root difference between critical and non critical threads?

> It's much simpler to catch and handle all exceptions in critical threads

I don't agree with you.
We develop Ignite not because it simple!
We must spend extra time to made it robust and resilient to the failures.

> Failure handling is a last-chance tool that reveals internal Ignite errors
> 100% agree with you: overcome, but not hide.

Logging stack trace with proper explanation is not hiding.
Killing nodes and whole cluster is not "handling".

> As far as I see from user-list messages, our users are qualified enough
to provide necessary information from their cluster-wide logs.

We shouldn't develop our product only for users who are able to read Ignite
sources to decrypt the fail reason behind "starvation in stripped pool"

Some of my questions remain unanswered :) :

1. How user can know it's an Ignite bug? Where this bug should be reported?
2. Do we log it somewhere?
3. Do we warn user before shutdown several times?
4. "starvation in stripped pool" I think it's not clear error message.
Let's make it more specific!
5. Let's write to the user log - what he or she should do to prevent this
error in future?


вт, 26 мар. 2019 г. в 12:13, Andrey Kuznetsov :

> Nikolay,
>
> >  Why we can't restart some thread?
> Technically, we can. It's just matter of design: the thread can be made
> non-critical, and we can restart it every time it dies. But such design
> looks poor to me. It's much simpler to catch and handle all exceptions in
> critical threads. Failure handling is a last-chance tool that reveals
> internal Ignite errors. It's not pleasant for us when users see these
> errors, but it's better than hiding.
>
> >  Actually, distributed systems are designed to overcome some bugs, thread
> failure, node failure, for example, isn't it?
> 100% agree with you: overcome, but not hide.
>
> >  How user can know it's a bug? Where this bug should be reported?
> As far as I see from user-list messages, our users are qualified enough to
> provide necessary information from their cluster-wide logs.
>
>
> вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :
>
> > Andrey.
> >
> > > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use
> to
> > wait for dead thread's magical resurrection.
> >
> > Why is it unrecoverable?
> > Why we can't restart some thread?
> > Is there some kind of nature limitation to not restart system thread?
> >
> > Actually, distributed systems are designed to overcome some bugs, thread
> > failure, node failure, for example, isn't it?
> > > if under some circumstances node> stop leads to cascade cluster crash,
> > then it's a bug
> >
> > How user can know it's a bug? Where this bug should be reported?
> > Do we log it somewhere?
> > Do we warn user before shutdown one or several times?
> >
> > This feature kills user experience literally now.
> >
> > If I would be a user of the product that just shutdown with poor log I
> > would throw this product away.
> > Do we want it for Ignite?
> >
> > From SO discussion I see following error message: ": >>> Possible
> > starvation in striped pool."
> > Are you sure this message are clear for Ignite user(not Ignite hacker)?
> > What user should do to prevent this error in future?
> >
> > В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't
> > like
> > > this behavior, but it may be useful sometimes: "frozen" threads have a
> > > chance to become active again after load decreases. As for
> > > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait
> > for
> > > dead thread's magical resurrection. Then, if under some circumstances
> > node
> > > stop leads to cascade cluster crash, then it's a bug, and it should be
> > > fixed. Once and for all. Instead of hiding the flaw we have in the
> > product.
> > >
> > > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh :
> > >
> > > > + 1 for having the default settings revisited.
> > > > I understand Andrey's reasonings, but sometimes taking nodes down is
> > too
> > > > radical (as in my case it was GridDhtInvalidPartitionException which
> > could
> > > > be ignored for a while when rebalancing <- I might be wrong here).
> > > >
> > > > -- Roman
> > > >
> > > >
> > > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > > dma...@apache.org> wrote:
> > > >
> > > > pNikolay,
> > > > Thanks for kicking off this discussion. Surprisingly, planned to
> start
> > a
> > > > similar one today and incidentally came across this thread.
> > > > Agree that the failure handler should be off by default or the
> default
> > > > settings have to be revisited. That's true that people are
> complaining
> > of
> > > > nodes shutdowns even on moderate workloads. For instance, that's the
> > most
> > > > recent 

[jira] [Created] (IGNITE-11627) Test CheckpointFreeListTest.testRestoreFreeListCorrectlyAfterRandomStop always fails in DiskCompression suite

2019-03-26 Thread Anton Kalashnikov (JIRA)
Anton Kalashnikov created IGNITE-11627:
--

 Summary: Test 
CheckpointFreeListTest.testRestoreFreeListCorrectlyAfterRandomStop always fails 
in DiskCompression suite
 Key: IGNITE-11627
 URL: https://issues.apache.org/jira/browse/IGNITE-11627
 Project: Ignite
  Issue Type: Bug
Reporter: Anton Kalashnikov
Assignee: Anton Kalashnikov


https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=5828425958400232265=testDetails_IgniteTests24Java8=%3Cdefault%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Ignite 2.7.5 Release scope

2019-03-26 Thread Ilya Kasnacheev
Hello!

If you ask me I vote +0,5 either, I am not entirely confident but I answer
a huge volume of questions on userlist which boil down to prematory
SYSTEM_WORKER_TERMINATION.

Regards,
-- 
Ilya Kasnacheev


вт, 26 мар. 2019 г. в 11:24, Dmitriy Pavlov :

> +0.5 from me from release point of view. If community agrees with solution,
> I can cherry pick fix later.
>
> вт, 26 мар. 2019 г., 8:59 Roman Shtykh :
>
> > Andrey, hmm, I don't think putting back the behavior (if it's safe) we
> > used to have with all those exceptions being logged etc. is hiding. I
> would
> > never propose something like that.
> > Btw, I have fixed the issue. If it looks good let's merge.
> >
> > -- Roman
> >
> >
> > On Tuesday, March 26, 2019, 2:46:08 p.m. GMT+9, Andrey Kuznetsov <
> > stku...@gmail.com> wrote:
> >
> >  Roman, I think the worst thing we can do is to hide the bug you
> > discovered. The sane options are either fix it urgently or classify it as
> > non-critical and postpone.
> > вт, 26 мар. 2019 г. в 05:13, Roman Shtykh :
> >
> > Guys, what do you think about disabling SYSTEM_WORKER_TERMINATION
> > (introduced with IEP-14) before "cluster shutdown" bugs are fixed, as
> > suggested by Nikolay I. in "GridDhtInvalidPartitionException takes the
> > cluster down" thread?
> >
> > -- Roman
> >
> >
> > On Tuesday, March 26, 2019, 3:41:29 a.m. GMT+9, Dmitriy Pavlov <
> > dpav...@apache.org> wrote:
> >
> >  Hi Ignite Developers,
> >
> > So because nobody raised any feature I would like to call for scope
> freeze
> > for 2.7.5.
> >
> > The scope is limited with corruption fix, Java 11 issues addressed.
> > https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.7.5
> >
> > Also, launch scripts will be tested for Java 12.
> >
> > We entered the Rampdown phase. See more info in
> > https://cwiki.apache.org/confluence/display/IGNITE/Release+Process
> >
> > Issues can be added to the scope only through discussion.
> >
> > Sincerely,
> > Dmitriy Pavlov
> >
> > пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev  >:
> >
> > > Hello!
> > >
> > > It seems that I can no longer test this case, on account of
> > >
> > >
> >
> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
> > > hanging every time under Java 11 on Windows.
> > >
> > > Alexey, Ivan, can you please take a look?
> > >
> > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> > >
> > > Regards,
> > >
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev <
> ilya.kasnach...@gmail.com
> > >:
> > >
> > > > Hello!
> > > >
> > > > Basically there is a test that explicitly highlights this problem,
> that
> > > is
> > > > running SSL tests on Windows + Java 11. They will hang on Master but
> > pass
> > > > with this patch.
> > > >
> > > > I have started that on TC, results will probably be available later
> > > today:
> > > >
> > > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> > > > (mind the Java version).
> > > >
> > > > Regards,
> > > > --
> > > > Ilya Kasnacheev
> > > >
> > > >
> > > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov :
> > > >
> > > >> Dmitry, Ilya,
> > > >>
> > > >> Yes, I've looked through those changes [1] as they can affect my
> local
> > > >> PR.  Basically, changes look good to me.
> > > >>
> > > >> I'm not an expert with CommunicationSpi component, so can miss some
> > > >> details and I haven't tested these changes under Java 11. One more
> > > >> thing I'd like to say, I would add additional tests to PR that will
> > > >> explicitly highlight the problem being solved.
> > > >>
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299
> > > >>
> > > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov 
> > > wrote:
> > > >> >
> > > >> > Hi Igniters,
> > > >> >
> > > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
> > > wait
> > > >> on
> > > >> > processWrite during SSL handshake.
> > > >> > seems to be blocker cause it is related to Java 11
> > > >> >
> > > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these
> > comments
> > > >> > addressed?
> > > >> >
> > > >> > The ticket is in Patch Available. Reviewer needed. Changes located
> > in
> > > >> > GridNioServer.
> > > >> >
> > > >> > Sincerely,
> > > >> > Dmitriy Pavlov
> > > >> >
> > > >> > P.S. a quite obvious ticket came to sope, as well:
> > > >> > https://issues.apache.org/jira/browse/IGNITE-11600
> > > >> >
> > > >> >
> > > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov :
> > > >> >
> > > >> > > Huge +1
> > > >> > >
> > > >> > > Will try to add new JDK in nearest time to our Teamcity.
> > > >> > >
> > > >> > >
> > > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov 
> > > >> wrote:
> > > >> > > >
> > > >> > > > Hi Igniters,
> > > >> > > >
> > > >> > > > Meanwhile, 

Re: UriDeploymentSpi and GAR files

2019-03-26 Thread Ilya Kasnacheev
Hello!

This looked sensible to me so I went forward and merged this change.

Regards,
-- 
Ilya Kasnacheev


пн, 25 мар. 2019 г. в 17:59, Denis Mekhanikov :

> Folks,
>
> I prepared a patch for the second ticket:
> https://github.com/apache/ignite/pull/6177
> Ilya is concerned, that if you had some JAR files, lying next to your GARs
> in a repository, which is referred to over UriDeploymentSpi, then these
> JARs will now be loaded as well. So, this is a behaviour change.
> I don't think, that this is really a problem. I don't see a simple solution
> to this, that wouldn't require an API change. And a complex change would be
> an overkill here.
> Loading what's located in the repository is pretty natural, so you
> shouldn't be surprised, when JARs start loading after an Ignite version
> upgrade.
>
> What do you think?
>
> Denis
>
> чт, 21 февр. 2019 г. в 17:48, Denis Mekhanikov :
>
> > I created the following tickets:
> >
> > https://issues.apache.org/jira/browse/IGNITE-11379 – drop support of
> GARs
> > https://issues.apache.org/jira/browse/IGNITE-11380 – support JARs
> > https://issues.apache.org/jira/browse/IGNITE-11381 – document ignite.xml
> > file format.
> >
> > Denis
> >
> > ср, 20 февр. 2019 г. в 12:30, Nikolay Izhikov :
> >
> >> Hello, Denis.
> >>
> >> > This XML may contain task descriptors, but I couldn't find any
> >> documentation on this format.
> >> > This information can be provided in simple JAR files with the same
> file
> >> structure.
> >>
> >> I support you proposal. Let's:
> >>
> >> 1. Support jar files instead of gar.
> >> 2. Write down documentation about XML config format.
> >> 3. Provide some examples.
> >>
> >> Can you crate a tickets for it?
> >>
> >>
> >> ср, 20 февр. 2019 г. в 11:49, Denis Mekhanikov :
> >>
> >> > Denis,
> >> >
> >> > This XML may contain task descriptors, but I couldn't find any
> >> > documentation on this format.
> >> > Also it may contain a userVersion [1] parameter, which can be used to
> >> force
> >> > tasks redeployment in some cases.
> >> >
> >> > This information can be provided in simple JAR files with the same
> file
> >> > structure.
> >> > There is no need to confuse people and require their packages to have
> a
> >> GAR
> >> > extension.
> >> >
> >> > Also if you don't specify the task descriptors, then all tasks in the
> >> file
> >> > will be registered.
> >> > So, I doubt, that anybody will bother specifying the descriptors. XML
> is
> >> > not very user-friendly.
> >> > This piece of configuration doesn't seem necessary to me.
> >> >
> >> > [1]
> >> >
> >> >
> >>
> https://apacheignite.readme.io/docs/deployment-modes#section-un-deployment-and-user-versions
> >> >
> >> > Denis
> >> >
> >> > ср, 20 февр. 2019 г. в 01:35, Denis Magda :
> >> >
> >> > > Denis,
> >> > >
> >> > > What was the purpose of having XML and other files within the GARs?
> >> Guess
> >> > > it was somehow versioning related - you might have several tasks of
> >> the
> >> > > same class but different versions running in a cluster.
> >> > >
> >> > > -
> >> > > Denis
> >> > >
> >> > >
> >> > > On Tue, Feb 19, 2019 at 8:40 AM Ilya Kasnacheev <
> >> > ilya.kasnach...@gmail.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > Hello!
> >> > > >
> >> > > > Yes, I think we should accept plain JARs if anybody needs this at
> >> all.
> >> > > > Might still keep meta info support for compatibility.
> >> > > >
> >> > > > Regards,
> >> > > > --
> >> > > > Ilya Kasnacheev
> >> > > >
> >> > > >
> >> > > > вт, 19 февр. 2019 г. в 19:38, Denis Mekhanikov <
> >> dmekhani...@gmail.com
> >> > >:
> >> > > >
> >> > > > > Hi!
> >> > > > >
> >> > > > > There is a feature in Ignite called DeploymentSpi [1], that
> allows
> >> > > adding
> >> > > > > and changing implementation of compute tasks without nodes'
> >> downtime.
> >> > > > > The only usable implementation right now is UriDeploymentSpi
> [2],
> >> > which
> >> > > > > lets you provide classes of compute tasks packaged as an archive
> >> of a
> >> > > > > special form. And this special form is the worst part.
> >> > > > > GAR file is just like a JAR, but with some additional meta info.
> >> It
> >> > may
> >> > > > > contain an XML with description of tasks, a checksum and also
> >> > > > dependencies.
> >> > > > >
> >> > > > > We barely have any tools to build these files, and they can be
> >> > replaced
> >> > > > > with simple uber-JARs.
> >> > > > > The only tool we have right now is IgniteDeploymentGarAntTask,
> >> which
> >> > is
> >> > > > not
> >> > > > > documented anywhere, and it's supposed to be used from a
> >> > long-forgotten
> >> > > > > Apache Ant build system.
> >> > > > >
> >> > > > > I don't think we need this file format. How about we deprecate
> and
> >> > > remove
> >> > > > > it and make UriDeploymentSpi support plain JARs?
> >> > > > >
> >> > > > > [1] https://apacheignite.readme.io/docs/deployment-spi
> >> > > > > [2]
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Andrey Kuznetsov
Nikolay,

>  Why we can't restart some thread?
Technically, we can. It's just matter of design: the thread can be made
non-critical, and we can restart it every time it dies. But such design
looks poor to me. It's much simpler to catch and handle all exceptions in
critical threads. Failure handling is a last-chance tool that reveals
internal Ignite errors. It's not pleasant for us when users see these
errors, but it's better than hiding.

>  Actually, distributed systems are designed to overcome some bugs, thread
failure, node failure, for example, isn't it?
100% agree with you: overcome, but not hide.

>  How user can know it's a bug? Where this bug should be reported?
As far as I see from user-list messages, our users are qualified enough to
provide necessary information from their cluster-wide logs.


вт, 26 мар. 2019 г. в 11:19, Nikolay Izhikov :

> Andrey.
>
> > As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to
> wait for dead thread's magical resurrection.
>
> Why is it unrecoverable?
> Why we can't restart some thread?
> Is there some kind of nature limitation to not restart system thread?
>
> Actually, distributed systems are designed to overcome some bugs, thread
> failure, node failure, for example, isn't it?
> > if under some circumstances node> stop leads to cascade cluster crash,
> then it's a bug
>
> How user can know it's a bug? Where this bug should be reported?
> Do we log it somewhere?
> Do we warn user before shutdown one or several times?
>
> This feature kills user experience literally now.
>
> If I would be a user of the product that just shutdown with poor log I
> would throw this product away.
> Do we want it for Ignite?
>
> From SO discussion I see following error message: ": >>> Possible
> starvation in striped pool."
> Are you sure this message are clear for Ignite user(not Ignite hacker)?
> What user should do to prevent this error in future?
>
> В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> > By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't
> like
> > this behavior, but it may be useful sometimes: "frozen" threads have a
> > chance to become active again after load decreases. As for
> > SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait
> for
> > dead thread's magical resurrection. Then, if under some circumstances
> node
> > stop leads to cascade cluster crash, then it's a bug, and it should be
> > fixed. Once and for all. Instead of hiding the flaw we have in the
> product.
> >
> > вт, 26 мар. 2019 г. в 09:17, Roman Shtykh :
> >
> > > + 1 for having the default settings revisited.
> > > I understand Andrey's reasonings, but sometimes taking nodes down is
> too
> > > radical (as in my case it was GridDhtInvalidPartitionException which
> could
> > > be ignored for a while when rebalancing <- I might be wrong here).
> > >
> > > -- Roman
> > >
> > >
> > > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > > dma...@apache.org> wrote:
> > >
> > > pNikolay,
> > > Thanks for kicking off this discussion. Surprisingly, planned to start
> a
> > > similar one today and incidentally came across this thread.
> > > Agree that the failure handler should be off by default or the default
> > > settings have to be revisited. That's true that people are complaining
> of
> > > nodes shutdowns even on moderate workloads. For instance, that's the
> most
> > > recent feedback related to slow checkpointing:
> > >
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > >
> > > At a minimum, let's consider the following:
> > >- A failure handler needs to provide hints on how to come around the
> > > shutdown in the future. Take the checkpointing SO thread above. It's
> > > unclear from the logs how to prevent the same situation next time
> (suggest
> > > parameters for tuning, flash drives, etc).
> > >- Is there any protection for a full cluster restart? We need to
> > > distinguish a slow cluster from the stuck one. A node removal should
> not
> > > lead to a meltdown of the whole storage.
> > >- Should we enable the failure handler for things like transactions
> or
> > > PME and have it off for checkpointing and something else? Let's have it
> > > enabled for cases when we are 100% certain that a node shutdown is the
> > > right thing and print out warnings with suggestions whenever we're not
> > > confident that the removal is appropriate.
> > > --Denis
> > >
> > > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura  wrote:
> > >
> > > Failure handlers were introduced in order to avoid cluster hanging and
> > > they kill nodes instead.
> > >
> > > If critical worker was terminated by GridDhtInvalidPartitionException
> > > then your node is unable to work anymore.
> > >
> > > Unexpected cluster shutdown with reasons in logs that failure handlers
> > > provide is better than hanging. So answer is NO. We mustn't disable
> > > 

Re: Ignite 2.7.5 Release scope

2019-03-26 Thread Dmitriy Pavlov
+0.5 from me from release point of view. If community agrees with solution,
I can cherry pick fix later.

вт, 26 мар. 2019 г., 8:59 Roman Shtykh :

> Andrey, hmm, I don't think putting back the behavior (if it's safe) we
> used to have with all those exceptions being logged etc. is hiding. I would
> never propose something like that.
> Btw, I have fixed the issue. If it looks good let's merge.
>
> -- Roman
>
>
> On Tuesday, March 26, 2019, 2:46:08 p.m. GMT+9, Andrey Kuznetsov <
> stku...@gmail.com> wrote:
>
>  Roman, I think the worst thing we can do is to hide the bug you
> discovered. The sane options are either fix it urgently or classify it as
> non-critical and postpone.
> вт, 26 мар. 2019 г. в 05:13, Roman Shtykh :
>
> Guys, what do you think about disabling SYSTEM_WORKER_TERMINATION
> (introduced with IEP-14) before "cluster shutdown" bugs are fixed, as
> suggested by Nikolay I. in "GridDhtInvalidPartitionException takes the
> cluster down" thread?
>
> -- Roman
>
>
> On Tuesday, March 26, 2019, 3:41:29 a.m. GMT+9, Dmitriy Pavlov <
> dpav...@apache.org> wrote:
>
>  Hi Ignite Developers,
>
> So because nobody raised any feature I would like to call for scope freeze
> for 2.7.5.
>
> The scope is limited with corruption fix, Java 11 issues addressed.
> https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+2.7.5
>
> Also, launch scripts will be tested for Java 12.
>
> We entered the Rampdown phase. See more info in
> https://cwiki.apache.org/confluence/display/IGNITE/Release+Process
>
> Issues can be added to the scope only through discussion.
>
> Sincerely,
> Dmitriy Pavlov
>
> пн, 25 мар. 2019 г. в 11:24, Ilya Kasnacheev :
>
> > Hello!
> >
> > It seems that I can no longer test this case, on account of
> >
> >
> TcpDiscoveryCoordinatorFailureTest#testClusterFailedNewCoordinatorInitialized
> > hanging every time under Java 11 on Windows.
> >
> > Alexey, Ivan, can you please take a look?
> >
> >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> >
> > Regards,
> >
> > --
> > Ilya Kasnacheev
> >
> >
> > пт, 22 мар. 2019 г. в 16:59, Ilya Kasnacheev  >:
> >
> > > Hello!
> > >
> > > Basically there is a test that explicitly highlights this problem, that
> > is
> > > running SSL tests on Windows + Java 11. They will hang on Master but
> pass
> > > with this patch.
> > >
> > > I have started that on TC, results will probably be available later
> > today:
> > >
> > >
> >
> https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_SpiWindows=buildTypeStatusDiv_IgniteTests24Java8=__all_branches__
> > > (mind the Java version).
> > >
> > > Regards,
> > > --
> > > Ilya Kasnacheev
> > >
> > >
> > > пт, 22 мар. 2019 г. в 14:13, Maxim Muzafarov :
> > >
> > >> Dmitry, Ilya,
> > >>
> > >> Yes, I've looked through those changes [1] as they can affect my local
> > >> PR.  Basically, changes look good to me.
> > >>
> > >> I'm not an expert with CommunicationSpi component, so can miss some
> > >> details and I haven't tested these changes under Java 11. One more
> > >> thing I'd like to say, I would add additional tests to PR that will
> > >> explicitly highlight the problem being solved.
> > >>
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11299
> > >>
> > >> On Thu, 21 Mar 2019 at 22:57, Dmitriy Pavlov 
> > wrote:
> > >> >
> > >> > Hi Igniters,
> > >> >
> > >> > fix https://issues.apache.org/jira/browse/IGNITE-11299 Avoid busy
> > wait
> > >> on
> > >> > processWrite during SSL handshake.
> > >> > seems to be blocker cause it is related to Java 11
> > >> >
> > >> > I see Maxim M left some comments. Ilya K., Maxim M.were these
> comments
> > >> > addressed?
> > >> >
> > >> > The ticket is in Patch Available. Reviewer needed. Changes located
> in
> > >> > GridNioServer.
> > >> >
> > >> > Sincerely,
> > >> > Dmitriy Pavlov
> > >> >
> > >> > P.S. a quite obvious ticket came to sope, as well:
> > >> > https://issues.apache.org/jira/browse/IGNITE-11600
> > >> >
> > >> >
> > >> > чт, 21 мар. 2019 г. в 16:55, Petr Ivanov :
> > >> >
> > >> > > Huge +1
> > >> > >
> > >> > > Will try to add new JDK in nearest time to our Teamcity.
> > >> > >
> > >> > >
> > >> > > > On 21 Mar 2019, at 16:27, Dmitriy Pavlov 
> > >> wrote:
> > >> > > >
> > >> > > > Hi Igniters,
> > >> > > >
> > >> > > > Meanwhile, Java 12 GA is available. I suggest at least test our
> > new
> > >> tests
> > >> > > > scripts with a couple of Java builds. WDYT?
> > >> > > >
> > >> > > > Sincerely,
> > >> > > > Dmitriy Pavlov
> > >> > > >
> > >> > > > ср, 20 мар. 2019 г. в 19:21, Dmitriy Pavlov  >:
> > >> > > >
> > >> > > >> Hi Ignite Developers,
> > >> > > >>
> > >> > > >> In a separate discussion, I've shared a log with all commits.
> > >> > > >>
> > >> > > >> As far as I can see, nobody removed commits from this sheet, so
> > the
> > >> > > scope
> > >> > > >> of release will be discussed in another way: only explicitly
> 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Nikolay Izhikov
Andrey.

> As for SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait 
> for dead thread's magical resurrection.

Why is it unrecoverable?
Why we can't restart some thread?
Is there some kind of nature limitation to not restart system thread?

Actually, distributed systems are designed to overcome some bugs, thread 
failure, node failure, for example, isn't it?
> if under some circumstances node> stop leads to cascade cluster crash, then 
> it's a bug

How user can know it's a bug? Where this bug should be reported?
Do we log it somewhere?
Do we warn user before shutdown one or several times?

This feature kills user experience literally now.

If I would be a user of the product that just shutdown with poor log I would 
throw this product away.
Do we want it for Ignite?

From SO discussion I see following error message: ": >>> Possible starvation in 
striped pool."
Are you sure this message are clear for Ignite user(not Ignite hacker)?
What user should do to prevent this error in future?

В Вт, 26/03/2019 в 10:10 +0300, Andrey Kuznetsov пишет:
> By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like
> this behavior, but it may be useful sometimes: "frozen" threads have a
> chance to become active again after load decreases. As for
> SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for
> dead thread's magical resurrection. Then, if under some circumstances node
> stop leads to cascade cluster crash, then it's a bug, and it should be
> fixed. Once and for all. Instead of hiding the flaw we have in the product.
> 
> вт, 26 мар. 2019 г. в 09:17, Roman Shtykh :
> 
> > + 1 for having the default settings revisited.
> > I understand Andrey's reasonings, but sometimes taking nodes down is too
> > radical (as in my case it was GridDhtInvalidPartitionException which could
> > be ignored for a while when rebalancing <- I might be wrong here).
> > 
> > -- Roman
> > 
> > 
> > On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> > dma...@apache.org> wrote:
> > 
> > pNikolay,
> > Thanks for kicking off this discussion. Surprisingly, planned to start a
> > similar one today and incidentally came across this thread.
> > Agree that the failure handler should be off by default or the default
> > settings have to be revisited. That's true that people are complaining of
> > nodes shutdowns even on moderate workloads. For instance, that's the most
> > recent feedback related to slow checkpointing:
> > https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
> > 
> > At a minimum, let's consider the following:
> >- A failure handler needs to provide hints on how to come around the
> > shutdown in the future. Take the checkpointing SO thread above. It's
> > unclear from the logs how to prevent the same situation next time (suggest
> > parameters for tuning, flash drives, etc).
> >- Is there any protection for a full cluster restart? We need to
> > distinguish a slow cluster from the stuck one. A node removal should not
> > lead to a meltdown of the whole storage.
> >- Should we enable the failure handler for things like transactions or
> > PME and have it off for checkpointing and something else? Let's have it
> > enabled for cases when we are 100% certain that a node shutdown is the
> > right thing and print out warnings with suggestions whenever we're not
> > confident that the removal is appropriate.
> > --Denis
> > 
> > On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura  wrote:
> > 
> > Failure handlers were introduced in order to avoid cluster hanging and
> > they kill nodes instead.
> > 
> > If critical worker was terminated by GridDhtInvalidPartitionException
> > then your node is unable to work anymore.
> > 
> > Unexpected cluster shutdown with reasons in logs that failure handlers
> > provide is better than hanging. So answer is NO. We mustn't disable
> > failure handlers.
> > 
> > On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh 
> > wrote:
> > > 
> > > If it sticks to the behavior we had before introducing failure handler,
> > 
> > I think it's better to have disabled instead of killing the whole cluster,
> > as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> > for the suggestion!
> > > 
> > > 
> > > 
> > > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> > 
> > nizhi...@apache.org> wrote:
> > > 
> > >  Guys.
> > > 
> > > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > > since it was introduced.
> > > 
> > > Should we disable it by default in 2.7.5?
> > > 
> > > 
> > > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko :
> > > 
> > > > Hi Roman,
> > > > 
> > > > I think this InvalidPartition case can be simply handled
> > > > in GridCacheTtlManager.expire method.
> > > > For workaround a custom FailureHandler can be configured that will not
> > 

Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Andrey Kuznetsov
By default, SYSTEM_WORKER_BLOCKED failure type is not handled. I don't like
this behavior, but it may be useful sometimes: "frozen" threads have a
chance to become active again after load decreases. As for
SYSTEM_WORKER_TERMINATION, it's unrecoverable, there is no use to wait for
dead thread's magical resurrection. Then, if under some circumstances node
stop leads to cascade cluster crash, then it's a bug, and it should be
fixed. Once and for all. Instead of hiding the flaw we have in the product.

вт, 26 мар. 2019 г. в 09:17, Roman Shtykh :

> + 1 for having the default settings revisited.
> I understand Andrey's reasonings, but sometimes taking nodes down is too
> radical (as in my case it was GridDhtInvalidPartitionException which could
> be ignored for a while when rebalancing <- I might be wrong here).
>
> -- Roman
>
>
> On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda <
> dma...@apache.org> wrote:
>
>  Nikolay,
> Thanks for kicking off this discussion. Surprisingly, planned to start a
> similar one today and incidentally came across this thread.
> Agree that the failure handler should be off by default or the default
> settings have to be revisited. That's true that people are complaining of
> nodes shutdowns even on moderate workloads. For instance, that's the most
> recent feedback related to slow checkpointing:
> https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure
>
> At a minimum, let's consider the following:
>- A failure handler needs to provide hints on how to come around the
> shutdown in the future. Take the checkpointing SO thread above. It's
> unclear from the logs how to prevent the same situation next time (suggest
> parameters for tuning, flash drives, etc).
>- Is there any protection for a full cluster restart? We need to
> distinguish a slow cluster from the stuck one. A node removal should not
> lead to a meltdown of the whole storage.
>- Should we enable the failure handler for things like transactions or
> PME and have it off for checkpointing and something else? Let's have it
> enabled for cases when we are 100% certain that a node shutdown is the
> right thing and print out warnings with suggestions whenever we're not
> confident that the removal is appropriate.
> --Denis
>
> On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura  wrote:
>
> Failure handlers were introduced in order to avoid cluster hanging and
> they kill nodes instead.
>
> If critical worker was terminated by GridDhtInvalidPartitionException
> then your node is unable to work anymore.
>
> Unexpected cluster shutdown with reasons in logs that failure handlers
> provide is better than hanging. So answer is NO. We mustn't disable
> failure handlers.
>
> On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh 
> wrote:
> >
> > If it sticks to the behavior we had before introducing failure handler,
> I think it's better to have disabled instead of killing the whole cluster,
> as in my case, and create a parent issue for those ten bugs.Pavel, thanks
> for the suggestion!
> >
> >
> >
> > On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov <
> nizhi...@apache.org> wrote:
> >
> >  Guys.
> >
> > We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> > Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> > since it was introduced.
> >
> > Should we disable it by default in 2.7.5?
> >
> >
> > пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko :
> >
> > > Hi Roman,
> > >
> > > I think this InvalidPartition case can be simply handled
> > > in GridCacheTtlManager.expire method.
> > > For workaround a custom FailureHandler can be configured that will not
> stop
> > > a node in case of such exception is thrown.
> > >
> > > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh :
> > >
> > > > Igniters,
> > > >
> > > > Restarting a node when injecting data and having it expired, results
> at
> > > > GridDhtInvalidPartitionException which terminates nodes with
> > > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down.
> This
> > > is
> > > > really bad and I didn't find the way to save the cluster from
> > > disappearing.
> > > > I created a JIRA issue
> > > https://issues.apache.org/jira/browse/IGNITE-11620
> > > > with a test case. Any clues how to fix this inconsistency when
> > > rebalancing?
> > > >
> > > > -- Roman
> > > >
> > >
>
>



-- 
Best regards,
  Andrey Kuznetsov.


Re: Review IGNITE-11411 'Remove tearDown, setUp from JUnit3TestLegacySupport'

2019-03-26 Thread Павлухин Иван
Ivan,

I noticed that you updated PR [1] recently and changed an execution
flow of setUp and tearDown methods in GridAbstractTest making it
similar to what we have in master now. What did not work in an initial
implementation? I spent some time seaching the reason why did we
introduce JUnit3TestLegacySupport and faced troubles. If we have some
special case here it sounds a good idea to add neccessary comments in
the code.

[1] https://github.com/apache/ignite/pull/6227

вт, 19 мар. 2019 г. в 11:59, Ivan Fedotov :
>
> Hi Eduard.
>
> Thank you for your participation in the review. In case of any questions
> feel free to ask me.
>
> вт, 19 мар. 2019 г. в 11:04, Eduard Shangareev  >:
>
> > Hi.
> >
> > I am interested in. If nobody did it I would do it next week.
> >
> > On Tue, Mar 19, 2019 at 10:20 AM Ivan Fedotov  wrote:
> >
> > > Hi Igniters!
> > >
> > > Now I am working on iep-30[1] which is about fully 4->5 migration and
> > > includes some moments according to JUnit 3->4 migration.
> > > I am on the first stage and finishing ticket about removing tearDown,
> > setUp
> > > from JUnit3TestLegacySupport [2].
> > >
> > > In nutshell: I removed setUp, tearDown from JUnit3TestLegacySupport and
> > > replaced them by beforeTest, afterTest in tests where they are used. That
> > > brings us to the JUnit5 test scenario because setUp and tearDown are used
> > > under Rule annotation in GridAbstractTest.
> > >
> > > Could somebody review this ticket, please?
> > >
> > > [1]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-30%3A+Migration+to+JUnit+5
> > > [2] https://issues.apache.org/jira/browse/IGNITE-11411
> > >
> > > --
> > > Ivan Fedotov.
> > >
> > > ivanan...@gmail.com
> > >
> >
>
>
> --
> Ivan Fedotov.
>
> ivanan...@gmail.com



-- 
Best regards,
Ivan Pavlukhin


Re: GridDhtInvalidPartitionException takes the cluster down

2019-03-26 Thread Roman Shtykh
+ 1 for having the default settings revisited.
I understand Andrey's reasonings, but sometimes taking nodes down is too 
radical (as in my case it was GridDhtInvalidPartitionException which could be 
ignored for a while when rebalancing <- I might be wrong here).

-- Roman
 

On Tuesday, March 26, 2019, 2:52:14 p.m. GMT+9, Denis Magda 
 wrote:  
 
 Nikolay,
Thanks for kicking off this discussion. Surprisingly, planned to start a 
similar one today and incidentally came across this thread.
Agree that the failure handler should be off by default or the default settings 
have to be revisited. That's true that people are complaining of nodes 
shutdowns even on moderate workloads. For instance, that's the most recent 
feedback related to slow 
checkpointing:https://stackoverflow.com/questions/55299337/stripped-pool-starvation-in-wal-writing-causes-node-cluster-node-failure

At a minimum, let's consider the following:   
   - A failure handler needs to provide hints on how to come around the 
shutdown in the future. Take the checkpointing SO thread above. It's unclear 
from the logs how to prevent the same situation next time (suggest parameters 
for tuning, flash drives, etc).
   - Is there any protection for a full cluster restart? We need to distinguish 
a slow cluster from the stuck one. A node removal should not lead to a meltdown 
of the whole storage.
   - Should we enable the failure handler for things like transactions or PME 
and have it off for checkpointing and something else? Let's have it enabled for 
cases when we are 100% certain that a node shutdown is the right thing and 
print out warnings with suggestions whenever we're not confident that the 
removal is appropriate.
--Denis

On Mon, Mar 25, 2019 at 5:52 AM Andrey Gura  wrote:

Failure handlers were introduced in order to avoid cluster hanging and
they kill nodes instead.

If critical worker was terminated by GridDhtInvalidPartitionException
then your node is unable to work anymore.

Unexpected cluster shutdown with reasons in logs that failure handlers
provide is better than hanging. So answer is NO. We mustn't disable
failure handlers.

On Mon, Mar 25, 2019 at 2:47 PM Roman Shtykh  wrote:
>
> If it sticks to the behavior we had before introducing failure handler, I 
> think it's better to have disabled instead of killing the whole cluster, as 
> in my case, and create a parent issue for those ten bugs.Pavel, thanks for 
> the suggestion!
>
>
>
>     On Monday, March 25, 2019, 7:07:20 p.m. GMT+9, Nikolay Izhikov 
> wrote:
>
>  Guys.
>
> We should fix the SYSTEM_WORKER_TERMINATION once and for all.
> Seems, we have ten or more "cluster shutdown" bugs with this subsystem
> since it was introduced.
>
> Should we disable it by default in 2.7.5?
>
>
> пн, 25 мар. 2019 г. в 13:04, Pavel Kovalenko :
>
> > Hi Roman,
> >
> > I think this InvalidPartition case can be simply handled
> > in GridCacheTtlManager.expire method.
> > For workaround a custom FailureHandler can be configured that will not stop
> > a node in case of such exception is thrown.
> >
> > пн, 25 мар. 2019 г. в 08:38, Roman Shtykh :
> >
> > > Igniters,
> > >
> > > Restarting a node when injecting data and having it expired, results at
> > > GridDhtInvalidPartitionException which terminates nodes with
> > > SYSTEM_WORKER_TERMINATION one by one taking the whole cluster down. This
> > is
> > > really bad and I didn't find the way to save the cluster from
> > disappearing.
> > > I created a JIRA issue
> > https://issues.apache.org/jira/browse/IGNITE-11620
> > > with a test case. Any clues how to fix this inconsistency when
> > rebalancing?
> > >
> > > -- Roman
> > >
> >