Re: [DISCUSSION] Ignite 3.0 and to be removed list

2019-07-16 Thread Denis Magda
Alex, Igniters, sorry for a delay. Got swamped with other duties.

Does it wait till the next week? I'll make sure to dedicate some time for
that. Or if we'd like to run faster then I'll appreciate if someone else
steps in and prepares a list this week. I'll help to review and solidify it.

-
Denis


On Tue, Jul 16, 2019 at 7:58 AM Alexey Goncharuk 
wrote:

> Denis,
>
> Are we ready to present the list to the user list?
>
> вт, 2 июл. 2019 г. в 00:27, Denis Magda :
>
> > I wouldn't kick off dozens of voting discussions. Instead, the content on
> > the wiki page needs to be cleaned and rearranged. This will make the
> > content readable and comprehensible. I can do that.
> >
> > Next, let's ask the user community for an opinion. After reviewing and
> > incorporating the latter we can do one more dev list discussion with the
> > last call for opinions. Next, will be the voting time. If there is a
> > feature someone from the dev list is against of removing, then we can
> start
> > a separate vote for it later. But let's get into those cases first.
> >
> > -
> > Denis
> >
> >
> > On Mon, Jul 1, 2019 at 1:47 AM Dmitriy Pavlov 
> wrote:
> >
> > > I propose each removal should have separated formal vote thread with
> > > consensus approval (since it is code modification).
> > >
> > > This means a single binding objection with justification is a blocker
> for
> > > removal.
> > >
> > > We need separation to let community members pick up an interesting
> topic
> > > from email subject. Not all members reading carefully each post in
> > > mile-long threads.
> > >
> > > пн, 1 июл. 2019 г. в 11:17, Anton Vinogradov :
> > >
> > > > +1 to email survey with following types of votes
> > > > - silence (agree with all proposed removals)
> > > > - we have to keep XXX because ...
> > > >
> > > > As a result, will gain lists
> > > > "to be removed" - no one objected
> > > > "can be removed" - single objection
> > > > "should be kept" - multi objections
> > > >
> > > > Denis or Dmitry Pavlov, could you please lead this thread?
> > > >
> > > > On Sat, Jun 29, 2019 at 12:27 AM Denis Magda 
> > wrote:
> > > >
> > > > > Alex,
> > > > >
> > > > > I would do an email survey to hear an opinion of why someone
> > believes a
> > > > > feature A has to stay. It makes sense to ask about the APIs to be
> > > removed
> > > > > as well as integrations to go out of community support [1] in the
> > same
> > > > > thread.
> > > > >
> > > > > Has everyone expressed an opinion? If yes, I can go ahead and
> format
> > > the
> > > > > wishlist page and make it structured for the user thread.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Modularization-td42486.html
> > > > > -
> > > > > Denis
> > > > >
> > > > >
> > > > > On Fri, Jun 28, 2019 at 8:54 AM Alexey Goncharuk <
> > > > > alexey.goncha...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Anton, good point.
> > > > > >
> > > > > > Do you have any idea how we can keep track of the voting? Should
> we
> > > > > launch
> > > > > > a google survey or survey monkey? Voting by email?
> > > > > >
> > > > > > пт, 28 июн. 2019 г. в 11:24, Anton Vinogradov :
> > > > > >
> > > > > > > Alexey,
> > > > > > >
> > > > > > > Thank's for keeping an eye on page updates.
> > > > > > > Near Caches is not a bad feature, but it should be used with
> > > caution.
> > > > > > > At least we have to explain how it works on readme.io, why and
> > > when
> > > > it
> > > > > > > should be used because usage can drop the performance instead
> of
> > > > > > increasing
> > > > > > > it.
> > > > > > >
> > > > > > > Anyway, I added near caches because I never heard someone used
> > them
> > > > > > > meaningfully, not like a silver bullet.
> > > > > > > So, that's just a proposal :)
> > > > > > >
> > > > > > > Also, I'd like to propose to have some voting about full list
> > later
> > > > to
> > > > > > gain
> > > > > > > "must be removed", "can be removed" and "should be kept" lists.
> > > > > > >
> > > > > > > On Thu, Jun 27, 2019 at 1:03 PM Alexey Goncharuk <
> > > > > > > alexey.goncha...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Anton,
> > > > > > > >
> > > > > > > > I would like to pull-up the discussion regarding the near
> > caches
> > > -
> > > > I
> > > > > > > cannot
> > > > > > > > agree this is a feature that needs to be removed. Near caches
> > > > provide
> > > > > > > > significant read performance improvements and, to the best of
> > my
> > > > > > > knowledge,
> > > > > > > > are used in several cases in production. Can you elaborate on
> > the
> > > > > > > > shortcomings you faced? Maybe we can improve both internal
> code
> > > and
> > > > > > user
> > > > > > > > experience?
> > > > > > > >
> > > > > > > > пт, 21 июн. 2019 г. в 10:42, Dmitry Melnichuk <
> > > > > > > > dmitry.melnic...@nobitlost.com>:
> > > > > > > >
> > > > > > > > > Dmitry,
> > > > > > > > > As a Python thin client developer, I 

[jira] [Created] (IGNITE-11986) Failed to deserialize object with given class loader: sun.misc.Launcher$AppClassLoader

2019-07-16 Thread JIRA
Jean-Denis Giguère created IGNITE-11986:
---

 Summary:  Failed to deserialize object with given class loader: 
sun.misc.Launcher$AppClassLoader
 Key: IGNITE-11986
 URL: https://issues.apache.org/jira/browse/IGNITE-11986
 Project: Ignite
  Issue Type: Bug
Affects Versions: mas
 Environment: Ignite master: commit 
{{1a2c35caf805769ca4e3f169d7a5c72c31147e41}}

spark 2.4.3

hadoop 3.1.2

OpenJDK 8


scala 2.11.12

 
Reporter: Jean-Denis Giguère
 Attachments: spark.log

h1. Current situation

Trying to create connect to a remote ignite cluster from {{spark-submit}}, I 
get the error message given in the error log attached.

See code snippet here : 
https://github.com/jdenisgiguere/ignite_failed_unmarshal_discovery_data

h2. Expected situation

We shall be able to connect to a remote Ignite even when we are using Hadoop 
3.1.x. 

h3. Steps to reproduce

See: [https://github.com/jdenisgiguere/ignite_failed_unmarshal_discovery_data]





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[MTCGA]: new failures in builds [4337855] needs to be handled

2019-07-16 Thread dpavlov . tasks
Hi Igniters,

 I've detected some new issue on TeamCity to be handled. You are more than 
welcomed to help.

 If your changes can lead to this failure(s): We're grateful that you were a 
volunteer to make the contribution to this project, but things change and you 
may no longer be able to finalize your contribution.
 Could you respond to this email and indicate if you wish to continue and fix 
test failures or step down and some committer may revert you commit. 

 *New test failure in master-nightly 
IgniteWalFlushLogOnlyWithMmapBufferSelfTest.testFailAfterStart 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=1669201947848381002=%3Cdefault%3E=testDetails

 *New test failure in master-nightly 
IgniteWalFlushBackgroundWithMmapBufferSelfTest.testFailAfterStart 
https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=5185752390317183577=%3Cdefault%3E=testDetails
 Changes may lead to failure were done by 
 - kaa@yandex.ru 
https://ci.ignite.apache.org/viewModification.html?modId=887940

 - Here's a reminder of what contributors were agreed to do 
https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute 
 - Should you have any questions please contact dev@ignite.apache.org 

Best Regards,
Apache Ignite TeamCity Bot 
https://github.com/apache/ignite-teamcity-bot
Notification generated at 19:18:58 16-07-2019 


Re: [DISCUSSION] Ignite 3.0 and to be removed list

2019-07-16 Thread Alexey Goncharuk
Denis,

Are we ready to present the list to the user list?

вт, 2 июл. 2019 г. в 00:27, Denis Magda :

> I wouldn't kick off dozens of voting discussions. Instead, the content on
> the wiki page needs to be cleaned and rearranged. This will make the
> content readable and comprehensible. I can do that.
>
> Next, let's ask the user community for an opinion. After reviewing and
> incorporating the latter we can do one more dev list discussion with the
> last call for opinions. Next, will be the voting time. If there is a
> feature someone from the dev list is against of removing, then we can start
> a separate vote for it later. But let's get into those cases first.
>
> -
> Denis
>
>
> On Mon, Jul 1, 2019 at 1:47 AM Dmitriy Pavlov  wrote:
>
> > I propose each removal should have separated formal vote thread with
> > consensus approval (since it is code modification).
> >
> > This means a single binding objection with justification is a blocker for
> > removal.
> >
> > We need separation to let community members pick up an interesting topic
> > from email subject. Not all members reading carefully each post in
> > mile-long threads.
> >
> > пн, 1 июл. 2019 г. в 11:17, Anton Vinogradov :
> >
> > > +1 to email survey with following types of votes
> > > - silence (agree with all proposed removals)
> > > - we have to keep XXX because ...
> > >
> > > As a result, will gain lists
> > > "to be removed" - no one objected
> > > "can be removed" - single objection
> > > "should be kept" - multi objections
> > >
> > > Denis or Dmitry Pavlov, could you please lead this thread?
> > >
> > > On Sat, Jun 29, 2019 at 12:27 AM Denis Magda 
> wrote:
> > >
> > > > Alex,
> > > >
> > > > I would do an email survey to hear an opinion of why someone
> believes a
> > > > feature A has to stay. It makes sense to ask about the APIs to be
> > removed
> > > > as well as integrations to go out of community support [1] in the
> same
> > > > thread.
> > > >
> > > > Has everyone expressed an opinion? If yes, I can go ahead and format
> > the
> > > > wishlist page and make it structured for the user thread.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Ignite-Modularization-td42486.html
> > > > -
> > > > Denis
> > > >
> > > >
> > > > On Fri, Jun 28, 2019 at 8:54 AM Alexey Goncharuk <
> > > > alexey.goncha...@gmail.com>
> > > > wrote:
> > > >
> > > > > Anton, good point.
> > > > >
> > > > > Do you have any idea how we can keep track of the voting? Should we
> > > > launch
> > > > > a google survey or survey monkey? Voting by email?
> > > > >
> > > > > пт, 28 июн. 2019 г. в 11:24, Anton Vinogradov :
> > > > >
> > > > > > Alexey,
> > > > > >
> > > > > > Thank's for keeping an eye on page updates.
> > > > > > Near Caches is not a bad feature, but it should be used with
> > caution.
> > > > > > At least we have to explain how it works on readme.io, why and
> > when
> > > it
> > > > > > should be used because usage can drop the performance instead of
> > > > > increasing
> > > > > > it.
> > > > > >
> > > > > > Anyway, I added near caches because I never heard someone used
> them
> > > > > > meaningfully, not like a silver bullet.
> > > > > > So, that's just a proposal :)
> > > > > >
> > > > > > Also, I'd like to propose to have some voting about full list
> later
> > > to
> > > > > gain
> > > > > > "must be removed", "can be removed" and "should be kept" lists.
> > > > > >
> > > > > > On Thu, Jun 27, 2019 at 1:03 PM Alexey Goncharuk <
> > > > > > alexey.goncha...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Anton,
> > > > > > >
> > > > > > > I would like to pull-up the discussion regarding the near
> caches
> > -
> > > I
> > > > > > cannot
> > > > > > > agree this is a feature that needs to be removed. Near caches
> > > provide
> > > > > > > significant read performance improvements and, to the best of
> my
> > > > > > knowledge,
> > > > > > > are used in several cases in production. Can you elaborate on
> the
> > > > > > > shortcomings you faced? Maybe we can improve both internal code
> > and
> > > > > user
> > > > > > > experience?
> > > > > > >
> > > > > > > пт, 21 июн. 2019 г. в 10:42, Dmitry Melnichuk <
> > > > > > > dmitry.melnic...@nobitlost.com>:
> > > > > > >
> > > > > > > > Dmitry,
> > > > > > > > As a Python thin client developer, I think that separate
> > > repository
> > > > > is
> > > > > > > > a truly great idea!
> > > > > > > > On Tue, 2019-06-18 at 21:29 +0300, Dmitriy Pavlov wrote:
> > > > > > > > > - Move to separate repositories: thin clients (at least
> > > non-Java
> > > > > > > > >
> > > > > > > > > > ones)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
I think administator of Ignite cluster should be able to monitor all Ignite 
process, including non blocking PME.

В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> BTW,
> Found PME metric - getCurrentPmeDuration().
> Seems, it shows exactly PME time and not so useful because of this.
> The goal it so show exactly blocking period.
> When PME cause no blocking, it's a good PME and I see no reason to have
> monitoring related to it :)
> 
> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov  wrote:
> 
> > Anton.
> > 
> > Why do we need to postpone implementation of this metrics?
> > For now, implementation of new metric is very simple.
> > 
> > I think we can implement this metrics as a single contribution.
> > 
> > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > > Nikita,
> > > 
> > > Looks like all we need now is a 1 simple metric: are operations blocked?
> > > Just a true or false.
> > > Lest start from this.
> > > All other metrics can be extracted from logs now and can be implemented
> > > later.
> > > 
> > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov 
> > > wrote:
> > > 
> > > > +1.
> > > > 
> > > > Nikita, please, go ahead.
> > > > 
> > > > 
> > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev :
> > > > 
> > > > > Hello, Igniters.
> > > > > 
> > > > > I suggest to add some useful metrics about the partition map exchange
> > > > > (PME). For now, the duration of PME stages available only in log
> > 
> > files
> > > > > and cannot be obtained using JMX or other external tools. [1]
> > > > > 
> > > > > I made the list of local node metrics that help to understand the
> > > > > actual status of current PME:
> > > > > 
> > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > 2. initTime. Time PME was started.
> > > > > 3. initEvent. Event that triggered PME.
> > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> > 
> > all
> > > > > updates and translations on a previous topology.
> > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > 7. finishTime. Time PME was ended.
> > > > > 
> > > > > When new PME started all these metrics resets.
> > > > > 
> > > > > These metrics help to understand:
> > > > > - how long PME was (current or previous).
> > > > > - how long awaited for all updates was completed.
> > > > > - what node blocks PME (didn't send a single message)
> > > > > - what triggered PME.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > 
> > > > > --
> > > > > Best wishes,
> > > > > Amelchev Nikita
> > > > > 


signature.asc
Description: This is a digitally signed message part


Re: Partition map exchange metrics

2019-07-16 Thread Anton Vinogradov
BTW,
Found PME metric - getCurrentPmeDuration().
Seems, it shows exactly PME time and not so useful because of this.
The goal it so show exactly blocking period.
When PME cause no blocking, it's a good PME and I see no reason to have
monitoring related to it :)

On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov  wrote:

> Anton.
>
> Why do we need to postpone implementation of this metrics?
> For now, implementation of new metric is very simple.
>
> I think we can implement this metrics as a single contribution.
>
> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> > Nikita,
> >
> > Looks like all we need now is a 1 simple metric: are operations blocked?
> > Just a true or false.
> > Lest start from this.
> > All other metrics can be extracted from logs now and can be implemented
> > later.
> >
> > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov 
> > wrote:
> >
> > > +1.
> > >
> > > Nikita, please, go ahead.
> > >
> > >
> > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev :
> > >
> > > > Hello, Igniters.
> > > >
> > > > I suggest to add some useful metrics about the partition map exchange
> > > > (PME). For now, the duration of PME stages available only in log
> files
> > > > and cannot be obtained using JMX or other external tools. [1]
> > > >
> > > > I made the list of local node metrics that help to understand the
> > > > actual status of current PME:
> > > >
> > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > 2. initTime. Time PME was started.
> > > > 3. initEvent. Event that triggered PME.
> > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> all
> > > > updates and translations on a previous topology.
> > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > 7. finishTime. Time PME was ended.
> > > >
> > > > When new PME started all these metrics resets.
> > > >
> > > > These metrics help to understand:
> > > > - how long PME was (current or previous).
> > > > - how long awaited for all updates was completed.
> > > > - what node blocks PME (didn't send a single message)
> > > > - what triggered PME.
> > > >
> > > > Thoughts?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita
> > > >
>


Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
Anton.

Why do we need to postpone implementation of this metrics?
For now, implementation of new metric is very simple.

I think we can implement this metrics as a single contribution.

В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> Nikita,
> 
> Looks like all we need now is a 1 simple metric: are operations blocked?
> Just a true or false.
> Lest start from this.
> All other metrics can be extracted from logs now and can be implemented
> later.
> 
> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov 
> wrote:
> 
> > +1.
> > 
> > Nikita, please, go ahead.
> > 
> > 
> > вт, 16 июля 2019 г., 11:45 Nikita Amelchev :
> > 
> > > Hello, Igniters.
> > > 
> > > I suggest to add some useful metrics about the partition map exchange
> > > (PME). For now, the duration of PME stages available only in log files
> > > and cannot be obtained using JMX or other external tools. [1]
> > > 
> > > I made the list of local node metrics that help to understand the
> > > actual status of current PME:
> > > 
> > > 1. initialVersion. Topology version that initiates the exchange.
> > > 2. initTime. Time PME was started.
> > > 3. initEvent. Event that triggered PME.
> > > 4. partitionReleaseTime. Time when a node has finished waiting for all
> > > updates and translations on a previous topology.
> > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > 7. finishTime. Time PME was ended.
> > > 
> > > When new PME started all these metrics resets.
> > > 
> > > These metrics help to understand:
> > > - how long PME was (current or previous).
> > > - how long awaited for all updates was completed.
> > > - what node blocks PME (didn't send a single message)
> > > - what triggered PME.
> > > 
> > > Thoughts?
> > > 
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > 
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> > > 


signature.asc
Description: This is a digitally signed message part


Re: Partition map exchange metrics

2019-07-16 Thread Anton Vinogradov
Nikita,

Looks like all we need now is a 1 simple metric: are operations blocked?
Just a true or false.
Lest start from this.
All other metrics can be extracted from logs now and can be implemented
later.

On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov 
wrote:

> +1.
>
> Nikita, please, go ahead.
>
>
> вт, 16 июля 2019 г., 11:45 Nikita Amelchev :
>
> > Hello, Igniters.
> >
> > I suggest to add some useful metrics about the partition map exchange
> > (PME). For now, the duration of PME stages available only in log files
> > and cannot be obtained using JMX or other external tools. [1]
> >
> > I made the list of local node metrics that help to understand the
> > actual status of current PME:
> >
> > 1. initialVersion. Topology version that initiates the exchange.
> > 2. initTime. Time PME was started.
> > 3. initEvent. Event that triggered PME.
> > 4. partitionReleaseTime. Time when a node has finished waiting for all
> > updates and translations on a previous topology.
> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > 6. recieveFullMessageTime. Time when a node received a full message.
> > 7. finishTime. Time PME was ended.
> >
> > When new PME started all these metrics resets.
> >
> > These metrics help to understand:
> > - how long PME was (current or previous).
> > - how long awaited for all updates was completed.
> > - what node blocks PME (didn't send a single message)
> > - what triggered PME.
> >
> > Thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >
>


Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
+1.

Nikita, please, go ahead.


вт, 16 июля 2019 г., 11:45 Nikita Amelchev :

> Hello, Igniters.
>
> I suggest to add some useful metrics about the partition map exchange
> (PME). For now, the duration of PME stages available only in log files
> and cannot be obtained using JMX or other external tools. [1]
>
> I made the list of local node metrics that help to understand the
> actual status of current PME:
>
> 1. initialVersion. Topology version that initiates the exchange.
> 2. initTime. Time PME was started.
> 3. initEvent. Event that triggered PME.
> 4. partitionReleaseTime. Time when a node has finished waiting for all
> updates and translations on a previous topology.
> 5. sendSingleMessageTime. Time when a node sent a single message.
> 6. recieveFullMessageTime. Time when a node received a full message.
> 7. finishTime. Time PME was ended.
>
> When new PME started all these metrics resets.
>
> These metrics help to understand:
> - how long PME was (current or previous).
> - how long awaited for all updates was completed.
> - what node blocks PME (didn't send a single message)
> - what triggered PME.
>
> Thoughts?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11961
>
> --
> Best wishes,
> Amelchev Nikita
>


Partition map exchange metrics

2019-07-16 Thread Nikita Amelchev
Hello, Igniters.

I suggest to add some useful metrics about the partition map exchange
(PME). For now, the duration of PME stages available only in log files
and cannot be obtained using JMX or other external tools. [1]

I made the list of local node metrics that help to understand the
actual status of current PME:

1. initialVersion. Topology version that initiates the exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has finished waiting for all
updates and translations on a previous topology.
5. sendSingleMessageTime. Time when a node sent a single message.
6. recieveFullMessageTime. Time when a node received a full message.
7. finishTime. Time PME was ended.

When new PME started all these metrics resets.

These metrics help to understand:
- how long PME was (current or previous).
- how long awaited for all updates was completed.
- what node blocks PME (didn't send a single message)
- what triggered PME.

Thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-11961

-- 
Best wishes,
Amelchev Nikita


Re: Tx lock partial happens before

2019-07-16 Thread Павлухин Иван
Anton,

Thank you for clarification.

вт, 16 июл. 2019 г. в 09:24, Anton Vinogradov :
>
> Ivan R.
>
> Thanks.
> I'll try to implement approach you proposed.
>
> Ivan P.
>
> >> what prevents primary partition relocation when
> >> Read Repair is in progress? Is there a transaction or an exlicit lock?
> Did you mean partition eviction?
> RR is almost a regular get with the same logic. It maps on some topology
> and performs regular gets.
> In case node failed or is not an owner anymore we'll just ignore this.
> See the code for details:
>
> if (invalidNodeSet.contains(affNode) || !cctx.discovery().alive(affNode)) {
> onDone(Collections.emptyMap()); // Finishing mini future with just "ok".
>
> On Tue, Jul 16, 2019 at 9:04 AM Павлухин Иван  wrote:
>
> > Anton,
> >
> > You referenced to failover scenarios. I believe that everything is
> > described in IEP. But to make this discussion self-sufficient could
> > you please outline what prevents primary partition relocation when
> > Read Repair is in progress? Is there a transaction or an exlicit lock?
> >
> > пн, 15 июл. 2019 г. в 23:49, Ivan Rakov :
> > >
> > > Anton,
> > >
> > > > Step-by-step:
> > > > 1) primary locked on key mention (get/put) at
> > pessimistic/!read-committed tx
> > > > 2) backups locked on prepare
> > > > 3) primary unlocked on finish
> > > > 4) backups unlocked on finish (after the primary)
> > > > correct?
> > > Yes, this corresponds to my understanding of transactions protocol. With
> > > minor exception: steps 3 and 4 are inverted in case of one-phase commit.
> > >
> > > > Agree, but seems there is no need to acquire the lock, we have just to
> > wait
> > > > until entry becomes unlocked.
> > > > - entry locked means that previous tx's "finish" phase is in progress
> > > > - entry unlocked means reading value is up-to-date (previous "finish"
> > phase
> > > > finished)
> > > > correct?
> > > Diving deeper, entry is locked if its GridCacheMapEntry.localCandidates
> > > queue is not empty (first item in queue is actually the transaction that
> > > owns lock).
> > >
> > > > we have just to wait
> > > > until entry becomes unlocked.
> > > This may work.
> > > If consistency checking code has acquired lock on primary, backup can be
> > > in two states:
> > > - not locked - and new locks won't appear as we are holding lock on
> > primary
> > > - still locked by transaction that owned lock on primary just before our
> > > checking code - in such case checking code should just wait for lock
> > release
> > >
> > > Best Regards,
> > > Ivan Rakov
> > >
> > > On 15.07.2019 9:34, Anton Vinogradov wrote:
> > > > Ivan R.
> > > >
> > > > Thanks for joining!
> > > >
> > > > Got an idea, but not sure that got a way of a fix.
> > > >
> > > > AFAIK (can be wrong, please correct if necessary), at 2PC, locks are
> > > > acquired on backups during the "prepare" phase and released at "finish"
> > > > phase after primary fully committed.
> > > > Step-by-step:
> > > > 1) primary locked on key mention (get/put) at
> > pessimistic/!read-committed tx
> > > > 2) backups locked on prepare
> > > > 3) primary unlocked on finish
> > > > 4) backups unlocked on finish (after the primary)
> > > > correct?
> > > >
> > > > So, acquiring locks on backups, not at the "prepare" phase, may cause
> > > > unexpected behavior in case of primary fail or other errors.
> > > > That's definitely possible to update failover to solve this issue, but
> > it
> > > > seems to be an overcomplicated way.
> > > > The main question there, it there any simple way?
> > > >
> > > >>> checking read from backup will just wait for commit if it's in
> > progress.
> > > > Agree, but seems there is no need to acquire the lock, we have just to
> > wait
> > > > until entry becomes unlocked.
> > > > - entry locked means that previous tx's "finish" phase is in progress
> > > > - entry unlocked means reading value is up-to-date (previous "finish"
> > phase
> > > > finished)
> > > > correct?
> > > >
> > > > On Mon, Jul 15, 2019 at 8:37 AM Павлухин Иван 
> > wrote:
> > > >
> > > >> Anton,
> > > >>
> > > >> I did not know mechanics locking entries on backups during prepare
> > > >> phase. Thank you for pointing that out!
> > > >>
> > > >> пт, 12 июл. 2019 г. в 22:45, Ivan Rakov :
> > > >>> Hi Anton,
> > > >>>
> > >  Each get method now checks the consistency.
> > >  Check means:
> > >  1) tx lock acquired on primary
> > >  2) gained data from each owner (primary and backups)
> > >  3) data compared
> > > >>> Did you consider acquiring locks on backups as well during your
> > check,
> > > >>> just like 2PC prepare does?
> > > >>> If there's HB between steps 1 (lock primary) and 2 (update primary +
> > > >>> lock backup + update backup), you may be sure that there will be no
> > > >>> false-positive results and no deadlocks as well. Protocol won't be
> > > >>> complicated: checking read from backup will just wait for commit if
> > it's
> > > >>> in progress.
> > > >>>
> > > >>> Best Regards,
> 

Re: Tx lock partial happens before

2019-07-16 Thread Anton Vinogradov
Ivan R.

Thanks.
I'll try to implement approach you proposed.

Ivan P.

>> what prevents primary partition relocation when
>> Read Repair is in progress? Is there a transaction or an exlicit lock?
Did you mean partition eviction?
RR is almost a regular get with the same logic. It maps on some topology
and performs regular gets.
In case node failed or is not an owner anymore we'll just ignore this.
See the code for details:

if (invalidNodeSet.contains(affNode) || !cctx.discovery().alive(affNode)) {
onDone(Collections.emptyMap()); // Finishing mini future with just "ok".

On Tue, Jul 16, 2019 at 9:04 AM Павлухин Иван  wrote:

> Anton,
>
> You referenced to failover scenarios. I believe that everything is
> described in IEP. But to make this discussion self-sufficient could
> you please outline what prevents primary partition relocation when
> Read Repair is in progress? Is there a transaction or an exlicit lock?
>
> пн, 15 июл. 2019 г. в 23:49, Ivan Rakov :
> >
> > Anton,
> >
> > > Step-by-step:
> > > 1) primary locked on key mention (get/put) at
> pessimistic/!read-committed tx
> > > 2) backups locked on prepare
> > > 3) primary unlocked on finish
> > > 4) backups unlocked on finish (after the primary)
> > > correct?
> > Yes, this corresponds to my understanding of transactions protocol. With
> > minor exception: steps 3 and 4 are inverted in case of one-phase commit.
> >
> > > Agree, but seems there is no need to acquire the lock, we have just to
> wait
> > > until entry becomes unlocked.
> > > - entry locked means that previous tx's "finish" phase is in progress
> > > - entry unlocked means reading value is up-to-date (previous "finish"
> phase
> > > finished)
> > > correct?
> > Diving deeper, entry is locked if its GridCacheMapEntry.localCandidates
> > queue is not empty (first item in queue is actually the transaction that
> > owns lock).
> >
> > > we have just to wait
> > > until entry becomes unlocked.
> > This may work.
> > If consistency checking code has acquired lock on primary, backup can be
> > in two states:
> > - not locked - and new locks won't appear as we are holding lock on
> primary
> > - still locked by transaction that owned lock on primary just before our
> > checking code - in such case checking code should just wait for lock
> release
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 15.07.2019 9:34, Anton Vinogradov wrote:
> > > Ivan R.
> > >
> > > Thanks for joining!
> > >
> > > Got an idea, but not sure that got a way of a fix.
> > >
> > > AFAIK (can be wrong, please correct if necessary), at 2PC, locks are
> > > acquired on backups during the "prepare" phase and released at "finish"
> > > phase after primary fully committed.
> > > Step-by-step:
> > > 1) primary locked on key mention (get/put) at
> pessimistic/!read-committed tx
> > > 2) backups locked on prepare
> > > 3) primary unlocked on finish
> > > 4) backups unlocked on finish (after the primary)
> > > correct?
> > >
> > > So, acquiring locks on backups, not at the "prepare" phase, may cause
> > > unexpected behavior in case of primary fail or other errors.
> > > That's definitely possible to update failover to solve this issue, but
> it
> > > seems to be an overcomplicated way.
> > > The main question there, it there any simple way?
> > >
> > >>> checking read from backup will just wait for commit if it's in
> progress.
> > > Agree, but seems there is no need to acquire the lock, we have just to
> wait
> > > until entry becomes unlocked.
> > > - entry locked means that previous tx's "finish" phase is in progress
> > > - entry unlocked means reading value is up-to-date (previous "finish"
> phase
> > > finished)
> > > correct?
> > >
> > > On Mon, Jul 15, 2019 at 8:37 AM Павлухин Иван 
> wrote:
> > >
> > >> Anton,
> > >>
> > >> I did not know mechanics locking entries on backups during prepare
> > >> phase. Thank you for pointing that out!
> > >>
> > >> пт, 12 июл. 2019 г. в 22:45, Ivan Rakov :
> > >>> Hi Anton,
> > >>>
> >  Each get method now checks the consistency.
> >  Check means:
> >  1) tx lock acquired on primary
> >  2) gained data from each owner (primary and backups)
> >  3) data compared
> > >>> Did you consider acquiring locks on backups as well during your
> check,
> > >>> just like 2PC prepare does?
> > >>> If there's HB between steps 1 (lock primary) and 2 (update primary +
> > >>> lock backup + update backup), you may be sure that there will be no
> > >>> false-positive results and no deadlocks as well. Protocol won't be
> > >>> complicated: checking read from backup will just wait for commit if
> it's
> > >>> in progress.
> > >>>
> > >>> Best Regards,
> > >>> Ivan Rakov
> > >>>
> > >>> On 12.07.2019 9:47, Anton Vinogradov wrote:
> >  Igniters,
> > 
> >  Let me explain problem in detail.
> >  Read Repair at pessimistic tx (locks acquired on primary, full sync,
> > >> 2pc)
> >  able to see consistency violation because backups are not updated
> yet.
> >  This seems 

Re: Read Repair (ex. Consistency Check) - review request #2

2019-07-16 Thread Anton Vinogradov
Svala,

>> Could you please take a look at PR:
Going to review today, thanks for attaching the bot visa.

>> 1. Should I consider that my cluster is broken? There is no answer! The
>> false-positive result is possible.
That's a question about atomic nature.
It's not impossible to lock atomic entry to perform the check.
You should perform some attempts, it's your decision how many.
By default, atomic RR performs 3 attempts, you may increase this by setting
IGNITE_NEAR_GET_MAX_REMAPS or by just performing additional gets.

>> 2. What should be done here in order to check/resolve the issue?
Perhaps, I
>> should restart a node (which one?), restart the whole cluster, put a new
>> value...
It's not possible, currently, to fix atomic caches.
You may only check the consistency. And it's better than nothing, I think.
We should find a way how to fix atomic consistency first.
A possible strategy is to use ЕntryProcessor which will replace all owner's
values with "latest" and do nothing in case newest (than latest) value
found (opposite to preloading approach).

>> 3. IgniteConsistencyViolationException is absolutely useless. It does not
>> provide any information about the issue and possible way to fix it.
It means that some keys from your get operation are broken.
IgniteConsistencyViolationException CAN be extended with a list of broken
keys in the future.

>> It seems that transactional caches are covered much better.
Correct.
Tx caches consistency is more important that atomic consistency, that's why
it was implemented first.
BTW, AFAIK, atomics also were not fixed at 10078 [1].

>> Well, near caches are widely used and fully transactional, so I think it
>> makes sense to support the feature for near caches too.
As I told before, it will be nice to implement this in the future, but we
have more important tasks for now.
The main goal was to cover tx caches, to be able to fix them in case of the
real problem at production.

Summarizing the roadmap,
My goal now is to finish the tx case, now we have an issue with false
positive consistency violation [2].
Also, we're going to update Jepsen tests [3] with RR to ensure tx caches
fixed.
Next main goal is to use RR at TC checks [4], help with this issue are
appreciated.

[1] https://issues.apache.org/jira/browse/IGNITE-10078
[2] https://issues.apache.org/jira/browse/IGNITE-11973
[3] https://issues.apache.org/jira/browse/IGNITE-11972
[4] https://issues.apache.org/jira/browse/IGNITE-11971


On Mon, Jul 15, 2019 at 4:51 PM Dmitriy Pavlov  wrote:

> Ok,  thank you
>
> пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov :
>
> > I did the review.
> >
> > пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov :
> >
> > > Igniters, who did a review of
> > > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> > I've
> > > checked both PR   https://github.com/apache/ignite/pull/5656  and
> Issue,
> > > and dev.list thread and didn't find any LGTM.
> > >
> > > Anton, since you've rejected lazy consensus in our process, we have RTC
> > in
> > > that (core) module. So I'd like to know if the fix was covered by the
> > > review.
> > >
> > > Because you're a committer, a reviewer can be non-committer. So, who
> was
> > a
> > > reviewer? Or was process ignored?
> > >
> > > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <
> slava.kopti...@gmail.com
> > >:
> > >
> > > > Hello Anton,
> > > >
> > > > > I'd like to propose you to provide fixes as a PR since you have a
> > > vision
> > > > of how it should be made. I'll review them and merge shortly.
> > > > Could you please take a look at PR:
> > > > https://github.com/apache/ignite/pull/6689
> > > >
> > > > > Since your comments mostly about Javadoc (does this mean that my
> > > solution
> > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > In my humble opinion, I would consider this feature as experimental
> one
> > > (It
> > > > does not seem production-ready).
> > > > Let me clarify this with the following simple example:
> > > >
> > > > try {
> > > > atomicCache.withReadRepair().getAll(keys);
> > > > }
> > > > catch (CacheException e) {
> > > > // What should be done here from the end-user point of view?
> > > > }
> > > >
> > > > 1. Should I consider that my cluster is broken? There is no answer!
> The
> > > > false-positive result is possible.
> > > > 2. What should be done here in order to check/resolve the issue?
> > > Perhaps, I
> > > > should restart a node (which one?), restart the whole cluster, put a
> > new
> > > > value...
> > > > 3. IgniteConsistencyViolationException is absolutely useless. It does
> > not
> > > > provide any information about the issue and possible way to fix it.
> > > >
> > > > It seems that transactional caches are covered much better.
> > > >
> > > > > Mostly agree with you, but
> > > > > - MVCC is not production ready,
> > > > > - not sure near support really required,
> > > > > - metrics are better for monitoring, but the Event is enough for my
> > > wish
> 

Re: Tx lock partial happens before

2019-07-16 Thread Павлухин Иван
Anton,

You referenced to failover scenarios. I believe that everything is
described in IEP. But to make this discussion self-sufficient could
you please outline what prevents primary partition relocation when
Read Repair is in progress? Is there a transaction or an exlicit lock?

пн, 15 июл. 2019 г. в 23:49, Ivan Rakov :
>
> Anton,
>
> > Step-by-step:
> > 1) primary locked on key mention (get/put) at pessimistic/!read-committed tx
> > 2) backups locked on prepare
> > 3) primary unlocked on finish
> > 4) backups unlocked on finish (after the primary)
> > correct?
> Yes, this corresponds to my understanding of transactions protocol. With
> minor exception: steps 3 and 4 are inverted in case of one-phase commit.
>
> > Agree, but seems there is no need to acquire the lock, we have just to wait
> > until entry becomes unlocked.
> > - entry locked means that previous tx's "finish" phase is in progress
> > - entry unlocked means reading value is up-to-date (previous "finish" phase
> > finished)
> > correct?
> Diving deeper, entry is locked if its GridCacheMapEntry.localCandidates
> queue is not empty (first item in queue is actually the transaction that
> owns lock).
>
> > we have just to wait
> > until entry becomes unlocked.
> This may work.
> If consistency checking code has acquired lock on primary, backup can be
> in two states:
> - not locked - and new locks won't appear as we are holding lock on primary
> - still locked by transaction that owned lock on primary just before our
> checking code - in such case checking code should just wait for lock release
>
> Best Regards,
> Ivan Rakov
>
> On 15.07.2019 9:34, Anton Vinogradov wrote:
> > Ivan R.
> >
> > Thanks for joining!
> >
> > Got an idea, but not sure that got a way of a fix.
> >
> > AFAIK (can be wrong, please correct if necessary), at 2PC, locks are
> > acquired on backups during the "prepare" phase and released at "finish"
> > phase after primary fully committed.
> > Step-by-step:
> > 1) primary locked on key mention (get/put) at pessimistic/!read-committed tx
> > 2) backups locked on prepare
> > 3) primary unlocked on finish
> > 4) backups unlocked on finish (after the primary)
> > correct?
> >
> > So, acquiring locks on backups, not at the "prepare" phase, may cause
> > unexpected behavior in case of primary fail or other errors.
> > That's definitely possible to update failover to solve this issue, but it
> > seems to be an overcomplicated way.
> > The main question there, it there any simple way?
> >
> >>> checking read from backup will just wait for commit if it's in progress.
> > Agree, but seems there is no need to acquire the lock, we have just to wait
> > until entry becomes unlocked.
> > - entry locked means that previous tx's "finish" phase is in progress
> > - entry unlocked means reading value is up-to-date (previous "finish" phase
> > finished)
> > correct?
> >
> > On Mon, Jul 15, 2019 at 8:37 AM Павлухин Иван  wrote:
> >
> >> Anton,
> >>
> >> I did not know mechanics locking entries on backups during prepare
> >> phase. Thank you for pointing that out!
> >>
> >> пт, 12 июл. 2019 г. в 22:45, Ivan Rakov :
> >>> Hi Anton,
> >>>
>  Each get method now checks the consistency.
>  Check means:
>  1) tx lock acquired on primary
>  2) gained data from each owner (primary and backups)
>  3) data compared
> >>> Did you consider acquiring locks on backups as well during your check,
> >>> just like 2PC prepare does?
> >>> If there's HB between steps 1 (lock primary) and 2 (update primary +
> >>> lock backup + update backup), you may be sure that there will be no
> >>> false-positive results and no deadlocks as well. Protocol won't be
> >>> complicated: checking read from backup will just wait for commit if it's
> >>> in progress.
> >>>
> >>> Best Regards,
> >>> Ivan Rakov
> >>>
> >>> On 12.07.2019 9:47, Anton Vinogradov wrote:
>  Igniters,
> 
>  Let me explain problem in detail.
>  Read Repair at pessimistic tx (locks acquired on primary, full sync,
> >> 2pc)
>  able to see consistency violation because backups are not updated yet.
>  This seems to be not a good idea to "fix" code to unlock primary only
> >> when
>  backups updated, this definitely will cause a performance drop.
>  Currently, there is no explicit sync feature allows waiting for backups
>  updated during the previous tx.
>  Previous tx just sends GridNearTxFinishResponse to the originating
> >> node.
>  Bad ideas how to handle this:
>  - retry some times (still possible to gain false positive)
>  - lock tx entry on backups (will definitely break failover logic)
>  - wait for same entry version on backups during some timeout (will
> >> require
>  huge changes at "get" logic and false positive still possible)
> 
>  Is there any simple fix for this issue?
>  Thanks for tips in advance.
> 
>  Ivan,
>  thanks for your interest
> 
> >> 4. Very fast and lucky txB writes a value