from:"Marshall McMullen"


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321602#comment-15321602
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

Assigning this to [~makuchta] as he's been working this issue for us.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-1485) client xid overflow is not handled


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1485:
-
Assignee: Martin Kuchta  (was: Bruce Gao)

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unable to contribute on JIRA

Yep, it works now. I was able to assign the Jira to Martin without problems
now. Again, thanks.

On Wed, Jun 8, 2016 at 4:33 PM, Marshall McMullen <
marshall.mcmul...@gmail.com> wrote:

> Thank you very much for the assistance Patric.
>
> On Wed, Jun 8, 2016 at 4:32 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> I've added Martin as a contributor, give it another try.
>>
>> Patrick
>>
>> On Wed, Jun 8, 2016 at 3:21 PM, Marshall McMullen <
>> marshall.mcmul...@gmail.com> wrote:
>>
>> > That makes sense. I would appreciate if a committer can change Martin's
>> > role to be contributer. Otherwise we'll reach out to the Infra team to
>> get
>> > some assistance on that.
>> >
>> > Thanks!
>> >
>> > On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:
>> >
>> > > I think someone (a committer probably only) just needs make Martin as
>> a
>> > > 'contributor' role.
>> > >
>> > > The best way to contact Apache Infra is through their Hipchat channel
>> > > http://www.apache.org/dev/infra-contact
>> > >
>> > > On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
>> > > marshall.mcmul...@gmail.com> wrote:
>> > >
>> > > > Should Martin contact the "Apache Infrastructure Team" regarding
>> this?
>> > If
>> > > > so, how does he do that?
>> > > >
>> > > > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
>> > > > marshall.mcmul...@gmail.com> wrote:
>> > > >
>> > > > > I tried to assign this Jira to him and got an error message back:
>> > > > >
>> > > > > User 'makuchta' cannot be assigned issues.
>> > > > >
>> > > > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com>
>> > wrote:
>> > > > >
>> > > > >> Martin,
>> > > > >>
>> > > > >> I had met similar issue earlier, here is an email sent earlier to
>> > dev
>> > > > >> list:
>> > > > >>
>> > > > >> >>
>> > > > >> FYI, I met an issue today that I can't attach files to a JIRA
>> issue
>> > > with
>> > > > >> the role of 'contributor'. Contacted Apache Infrastructure team
>> and
>> > > > >> confirmed that:
>> > > > >>
>> > > > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
>> > > *committer*
>> > > > >> can attach file.
>> > > > >> - A contributor can only attach files to issues that's assigned
>> > and/or
>> > > > >> reporting to the contributor.
>> > > > >> - A workaround for a contributor to attach files to any issue is
>> to
>> > > > first
>> > > > >> change assignee to the contributor, then attach files, then
>> change
>> > the
>> > > > >> assignee back.
>> > > > >> >>
>> > > > >>
>> > > > >> I think someone just need to assign ZOOKEEPER-2355 to you since
>> you
>> > > are
>> > > > >> working on it.
>> > > > >>
>> > > > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
>> > > mar...@martinkuchta.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Hi,
>> > > > >> >
>> > > > >> > Does anyone know if I need to do anything special to have the
>> > > ability
>> > > > to
>> > > > >> > submit attachments and be assigned issues on JIRA? I was
>> recently
>> > > > >> trying to
>> > > > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
>> > > missing
>> > > > >> for
>> > > > >> > me. It's not present on any other ZooKeeper JIRAs that I can
>> see,
>> > > > >> although
>> > > > >> > I can see it on JIRAs from other Apache projects.
>> > > > >> >
>> > > > >> > I was working with Marshall McMullen to get the patch
>> submitted,
>> > and
>> > > > our
>> > > > >> > first thought was that the issue might need to be assigned to
>> me,
>> > > but
>> > > > >> even
>> > > > >> > though he was able to reassign the issue, I was not a valid
>> user
>> > to
>> > > > >> assign
>> > > > >> > it to.
>> > > > >> >
>> > > > >> > My account username is makuchta. I created it almost two weeks
>> ago
>> > > if
>> > > > >> > that's of any relevance.
>> > > > >> >
>> > > > >> >
>> > > > >> > Thanks,
>> > > > >> >
>> > > > >> > Martin
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Cheers
>> > > > >> Michael.
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Cheers
>> > > Michael.
>> > >
>> >
>>
>
>

[jira] [Updated] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2355:
-
Assignee: Martin Kuchta  (was: Marshall McMullen)

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Martin Kuchta
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unable to contribute on JIRA

Thank you very much for the assistance Patric.

On Wed, Jun 8, 2016 at 4:32 PM, Patrick Hunt <ph...@apache.org> wrote:

> I've added Martin as a contributor, give it another try.
>
> Patrick
>
> On Wed, Jun 8, 2016 at 3:21 PM, Marshall McMullen <
> marshall.mcmul...@gmail.com> wrote:
>
> > That makes sense. I would appreciate if a committer can change Martin's
> > role to be contributer. Otherwise we'll reach out to the Infra team to
> get
> > some assistance on that.
> >
> > Thanks!
> >
> > On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:
> >
> > > I think someone (a committer probably only) just needs make Martin as a
> > > 'contributor' role.
> > >
> > > The best way to contact Apache Infra is through their Hipchat channel
> > > http://www.apache.org/dev/infra-contact
> > >
> > > On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
> > > marshall.mcmul...@gmail.com> wrote:
> > >
> > > > Should Martin contact the "Apache Infrastructure Team" regarding
> this?
> > If
> > > > so, how does he do that?
> > > >
> > > > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
> > > > marshall.mcmul...@gmail.com> wrote:
> > > >
> > > > > I tried to assign this Jira to him and got an error message back:
> > > > >
> > > > > User 'makuchta' cannot be assigned issues.
> > > > >
> > > > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com>
> > wrote:
> > > > >
> > > > >> Martin,
> > > > >>
> > > > >> I had met similar issue earlier, here is an email sent earlier to
> > dev
> > > > >> list:
> > > > >>
> > > > >> >>
> > > > >> FYI, I met an issue today that I can't attach files to a JIRA
> issue
> > > with
> > > > >> the role of 'contributor'. Contacted Apache Infrastructure team
> and
> > > > >> confirmed that:
> > > > >>
> > > > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
> > > *committer*
> > > > >> can attach file.
> > > > >> - A contributor can only attach files to issues that's assigned
> > and/or
> > > > >> reporting to the contributor.
> > > > >> - A workaround for a contributor to attach files to any issue is
> to
> > > > first
> > > > >> change assignee to the contributor, then attach files, then change
> > the
> > > > >> assignee back.
> > > > >> >>
> > > > >>
> > > > >> I think someone just need to assign ZOOKEEPER-2355 to you since
> you
> > > are
> > > > >> working on it.
> > > > >>
> > > > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
> > > mar...@martinkuchta.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > Does anyone know if I need to do anything special to have the
> > > ability
> > > > to
> > > > >> > submit attachments and be assigned issues on JIRA? I was
> recently
> > > > >> trying to
> > > > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
> > > missing
> > > > >> for
> > > > >> > me. It's not present on any other ZooKeeper JIRAs that I can
> see,
> > > > >> although
> > > > >> > I can see it on JIRAs from other Apache projects.
> > > > >> >
> > > > >> > I was working with Marshall McMullen to get the patch submitted,
> > and
> > > > our
> > > > >> > first thought was that the issue might need to be assigned to
> me,
> > > but
> > > > >> even
> > > > >> > though he was able to reassign the issue, I was not a valid user
> > to
> > > > >> assign
> > > > >> > it to.
> > > > >> >
> > > > >> > My account username is makuchta. I created it almost two weeks
> ago
> > > if
> > > > >> > that's of any relevance.
> > > > >> >
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Martin
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Cheers
> > > > >> Michael.
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers
> > > Michael.
> > >
> >
>

Re: Unable to contribute on JIRA

That makes sense. I would appreciate if a committer can change Martin's
role to be contributer. Otherwise we'll reach out to the Infra team to get
some assistance on that.

Thanks!

On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:

> I think someone (a committer probably only) just needs make Martin as a
> 'contributor' role.
>
> The best way to contact Apache Infra is through their Hipchat channel
> http://www.apache.org/dev/infra-contact
>
> On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
> marshall.mcmul...@gmail.com> wrote:
>
> > Should Martin contact the "Apache Infrastructure Team" regarding this? If
> > so, how does he do that?
> >
> > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
> > marshall.mcmul...@gmail.com> wrote:
> >
> > > I tried to assign this Jira to him and got an error message back:
> > >
> > > User 'makuchta' cannot be assigned issues.
> > >
> > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:
> > >
> > >> Martin,
> > >>
> > >> I had met similar issue earlier, here is an email sent earlier to dev
> > >> list:
> > >>
> > >> >>
> > >> FYI, I met an issue today that I can't attach files to a JIRA issue
> with
> > >> the role of 'contributor'. Contacted Apache Infrastructure team and
> > >> confirmed that:
> > >>
> > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
> *committer*
> > >> can attach file.
> > >> - A contributor can only attach files to issues that's assigned and/or
> > >> reporting to the contributor.
> > >> - A workaround for a contributor to attach files to any issue is to
> > first
> > >> change assignee to the contributor, then attach files, then change the
> > >> assignee back.
> > >> >>
> > >>
> > >> I think someone just need to assign ZOOKEEPER-2355 to you since you
> are
> > >> working on it.
> > >>
> > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
> mar...@martinkuchta.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Does anyone know if I need to do anything special to have the
> ability
> > to
> > >> > submit attachments and be assigned issues on JIRA? I was recently
> > >> trying to
> > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
> missing
> > >> for
> > >> > me. It's not present on any other ZooKeeper JIRAs that I can see,
> > >> although
> > >> > I can see it on JIRAs from other Apache projects.
> > >> >
> > >> > I was working with Marshall McMullen to get the patch submitted, and
> > our
> > >> > first thought was that the issue might need to be assigned to me,
> but
> > >> even
> > >> > though he was able to reassign the issue, I was not a valid user to
> > >> assign
> > >> > it to.
> > >> >
> > >> > My account username is makuchta. I created it almost two weeks ago
> if
> > >> > that's of any relevance.
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Martin
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Cheers
> > >> Michael.
> > >>
> > >
> > >
> >
>
>
>
> --
> Cheers
> Michael.
>

Re: Unable to contribute on JIRA

Should Martin contact the "Apache Infrastructure Team" regarding this? If
so, how does he do that?

On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
marshall.mcmul...@gmail.com> wrote:

> I tried to assign this Jira to him and got an error message back:
>
> User 'makuchta' cannot be assigned issues.
>
> On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:
>
>> Martin,
>>
>> I had met similar issue earlier, here is an email sent earlier to dev
>> list:
>>
>> >>
>> FYI, I met an issue today that I can't attach files to a JIRA issue with
>> the role of 'contributor'. Contacted Apache Infrastructure team and
>> confirmed that:
>>
>> - For a given JIRA issue, only *reporter*, or *assignee*, or *committer*
>> can attach file.
>> - A contributor can only attach files to issues that's assigned and/or
>> reporting to the contributor.
>> - A workaround for a contributor to attach files to any issue is to first
>> change assignee to the contributor, then attach files, then change the
>> assignee back.
>> >>
>>
>> I think someone just need to assign ZOOKEEPER-2355 to you since you are
>> working on it.
>>
>> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <mar...@martinkuchta.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Does anyone know if I need to do anything special to have the ability to
>> > submit attachments and be assigned issues on JIRA? I was recently
>> trying to
>> > submit a patch for ZOOKEEPER-2355 and realized the option was missing
>> for
>> > me. It's not present on any other ZooKeeper JIRAs that I can see,
>> although
>> > I can see it on JIRAs from other Apache projects.
>> >
>> > I was working with Marshall McMullen to get the patch submitted, and our
>> > first thought was that the issue might need to be assigned to me, but
>> even
>> > though he was able to reassign the issue, I was not a valid user to
>> assign
>> > it to.
>> >
>> > My account username is makuchta. I created it almost two weeks ago if
>> > that's of any relevance.
>> >
>> >
>> > Thanks,
>> >
>> > Martin
>>
>>
>>
>>
>> --
>> Cheers
>> Michael.
>>
>
>

Re: Unable to contribute on JIRA

I tried to assign this Jira to him and got an error message back:

User 'makuchta' cannot be assigned issues.

On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:

> Martin,
>
> I had met similar issue earlier, here is an email sent earlier to dev list:
>
> >>
> FYI, I met an issue today that I can't attach files to a JIRA issue with
> the role of 'contributor'. Contacted Apache Infrastructure team and
> confirmed that:
>
> - For a given JIRA issue, only *reporter*, or *assignee*, or *committer*
> can attach file.
> - A contributor can only attach files to issues that's assigned and/or
> reporting to the contributor.
> - A workaround for a contributor to attach files to any issue is to first
> change assignee to the contributor, then attach files, then change the
> assignee back.
> >>
>
> I think someone just need to assign ZOOKEEPER-2355 to you since you are
> working on it.
>
> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <mar...@martinkuchta.com>
> wrote:
>
> > Hi,
> >
> > Does anyone know if I need to do anything special to have the ability to
> > submit attachments and be assigned issues on JIRA? I was recently trying
> to
> > submit a patch for ZOOKEEPER-2355 and realized the option was missing for
> > me. It's not present on any other ZooKeeper JIRAs that I can see,
> although
> > I can see it on JIRAs from other Apache projects.
> >
> > I was working with Marshall McMullen to get the patch submitted, and our
> > first thought was that the issue might need to be assigned to me, but
> even
> > though he was able to reassign the issue, I was not a valid user to
> assign
> > it to.
> >
> > My account username is makuchta. I created it almost two weeks ago if
> > that's of any relevance.
> >
> >
> > Thanks,
> >
> > Martin
>
>
>
>
> --
> Cheers
> Michael.
>

[jira] [Comment Edited] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321410#comment-15321410
 ] 

Marshall McMullen edited comment on ZOOKEEPER-2355 at 6/8/16 8:43 PM:
--

[~makuchta] - I'll leave you to investigate the failure reported above.


was (Author: marshall):
@makuchta - I'll leave you to investigate the failure reported above.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321410#comment-15321410
 ] 

Marshall McMullen commented on ZOOKEEPER-2355:
--

@makuchta - I'll leave you to investigate the failure reported above.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2355:
-
Attachment: ZOOKEEPER-2355-03.patch

Updated patch with Martin's proposed solution.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen reassigned ZOOKEEPER-2355:


Assignee: Marshall McMullen  (was: Arshad Mohammad)

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-31 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307987#comment-15307987
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

[~fpj] - I agree we should fix ZOOKEEPER-22. Does it make sense to fix this 
case first and then come back to ZOOKEEPER-22? It seems like we should handle 
overflow safely either way and in that regard I think ZOOKEEPER-22 would be 
good follow-on work to do after this one. 

I think the issue that [~makuchta] brought up with regard to closing the 
session is not understanding how the client reacts to having the session closed.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-27 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304564#comment-15304564
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

[~fanster.z], [~fpj] or [~michim] - any of you have any thoughts on this?

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2152) Intermittent failure in TestReconfig.cc

2016-05-27 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304556#comment-15304556
 ] 

Marshall McMullen commented on ZOOKEEPER-2152:
--

[~makuchta] - This intermittent test failure and the thoughts folks had on this 
may interest you as you're seeing this as well I think.

> Intermittent failure in TestReconfig.cc
> ---
>
> Key: ZOOKEEPER-2152
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: c client
>Reporter: Michi Mutsuzaki
>Assignee: Michael Han
>  Labels: reconfiguration
> Fix For: 3.6.0
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec] 
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
>  Assertion: assertion failed [Expression: found != string::npos, 
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-26 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303532#comment-15303532
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

I think that [~makuchta] is right on this one as well. If he's right, and the 
only purpose of the C client xid is to to track equality of operations 
submitted to the server and the responses that come back, then it seems like 
the simplest, and most correct thing to do here is the following:

1. In get_xid(), we should initialize xid to 0 rather than time(0). Starting at 
zero instead of the time since the epoch ensures we have as much runway as 
possible before we wrap.

2. As Martin suggests, Inside get_xid, if we overflow INT32_MAX, then simply 
set it back to 0. I don't think there's any risk of collisions here since that 
gives us as the maximum amount of digits before wrapping. The odds of an 
existing in-flight operation that happened 2147483647 operations ago still 
lingering around or causing any confusion seems beyond unlikely IMO.

The nice thing about this is we don't have to make any changes to the server or 
worry about compatibility.

[~phunt] what do you think?

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Marshall McMullen (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen resolved ZOOKEEPER-2318.
--
Resolution: Duplicate

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303521#comment-15303521
 ] 

Marshall McMullen commented on ZOOKEEPER-2318:
--

I agree with [~makuchta], this looks identical to ZOOKEEPER-1485. The tell-tale 
is that right before this error, in every occurrence we've seen, we see this 
super important indicator of ZOOKEEPER-1485:

{code}
Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in progress): 
unexpected server response: expected 0xfff9, but received 0xfff8
{code}

I'll close this as a duplicate of ZOOKEEPER-1485. Nice sleuthing on this one 
[~makuchta]

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2152) Intermittent failure in TestReconfig.cc

2016-05-05 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272510#comment-15272510
 ] 

Marshall McMullen commented on ZOOKEEPER-2152:
--

[~shralex] and [~hanm] - I've been so unbearably swamped at work the last 6 
months that I've not been able to come up for air at all. I'm happy to help 
advise and review changes on this but don't have the bandwidth to commit to 
working on this myself in the near term. I'm hoping things will quiet down for 
me at work so I can start contributing more here as there are so many things 
I'd like to do! Thanks guys!

> Intermittent failure in TestReconfig.cc
> ---
>
> Key: ZOOKEEPER-2152
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: c client
>Reporter: Michi Mutsuzaki
>Assignee: Michael Han
>  Labels: reconfiguration
> Fix For: 3.6.0
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec] 
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
>  Assertion: assertion failed [Expression: found != string::npos, 
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2355) ZooKeeper ephemeral node is never deleted if follower fail while reading the proposal packet

2016-01-18 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105914#comment-15105914
 ] 

Marshall McMullen commented on ZOOKEEPER-2355:
--

I wonder if this is the same issue described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-2145

> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> 
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
>Priority: Critical
> Attachments: ZOOKEEPER-2355-01.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-01-06 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085877#comment-15085877
 ] 

Marshall McMullen commented on ZOOKEEPER-2318:
--

Anyone else seeing this? We haven't updated our internal ZooKeeper version in 
quite a while so it's possible this is fixed in newer versions

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-12-05 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043537#comment-15043537
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

[~rgs] - Yes, I agree. The short read is still a problem. I think the EBADF is 
actually a bug in our application not in ZooKeeper. So unless I discover 
otherwise, I think we should ignore the EBADF for now. I'll open a separate 
jira if I find it's a real issue.

I will regenerate this patch though b/c I didn't create it properly the first 
time.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-12-05 Thread Marshall McMullen (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Attachment: ZOOKEEPER-2311.patch

Updated patch to be generated from the right directory this time.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch, ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-12-01 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034904#comment-15034904
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

I got another recreate of this and this time got a core file. And I was wrong 
originally. It's not a short read that is causing this. Instead the read is 
failing with a return code of -1 and errno is set to EBADF. The manpage for 
read(2) indicates this can only happen when:

{code}
   EBADF  fd is not a valid file descriptor or is not open for reading.
{code}

But we specifically opened it 2 lines of code above that and checked to ensure 
it wasn't -1. 

In the core file I also see that the fd is valid:

{code}
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
476 in src/zookeeper.c
(gdb) print errno
$3 = 9
(gdb) print fd
$4 = 140
(gdb) print seed
$5 = 32671
{code}

It's odd that seed has something in it. That could mean we read _something_, 
but it could also be because this code never initialized seed to zero and it's 
got whatever garbage was on the stack.

The only other thing that's very curious here is that I think when this happens 
it coincides with a call to zookeeper_close. But this is a local stack variable 
so I can't fathom how that could cause this failure scenario.

I'll keep digging.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032536#comment-15032536
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

A very specific LKML thread related to this exact behavior: 
https://lkml.org/lkml/2005/1/13/485

This email thread indicates that there is in general an assumption that reading 
from /dev/urandom will never result in a short read. In actuality, in the face 
of signals, that's not really guaranteed. As with any call to read(2), it must 
handle short reads properly. 

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Attachment: ZOOKEEPER-2311.patch

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032765#comment-15032765
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

Uploaded patch to harden setup_random against short reads from /dev/urandom per 
LKML thread indicating this is a valid non-error path.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ZOOKEEPER-2318) segfault in auth_completion_func

2015-11-09 Thread Marshall McMullen (JIRA)

Marshall McMullen created ZOOKEEPER-2318:


 Summary: segfault in auth_completion_func
 Key: ZOOKEEPER-2318
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.0
Reporter: Marshall McMullen


We have seen some sporadic issues with unexplained segfaults inside 
auth_completion_func. The interesting thing is we are not using any auth 
mechanism at all. This happened against this version of the code:

svn.apache.org/repos/asf/zookeeper/trunk@1547702

Here's the stacktrace we are seeing:

{code}
Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
#0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
src/zookeeper.c:1696
#1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
src/zookeeper.c:2708
#2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
#3  0x7f21eeab7e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x in ?? ()
{code}

The offending line in our case is:

1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s succeeded", 
zh->auth_h.auth->scheme);

It must be the case that zh->auth_h.auth is NULL for this to happen since the 
code path returns if zh is NULL.

Interesting log messages around this time:

{code}
Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in progress): 
unexpected server response: expected 0xfff9, but received 0xfff8
Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
initiated connection to server [10.170.243.4:2181]
Oct 13 12:03:21.273384 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.170.243.4:48523
Oct 13 12:03:21.274321 zookeeper - WARN  
[NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
/10.170.243.4:48523; will be dropped if server is in r-o mode
Oct 13 12:03:21.274452 zookeeper - INFO  
[NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
server last zxid is 0x30370eb4d
Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
Revalidating client: 0x311596d004a
session establishment complete on server [10.170.243.4:2181], 
sessionId=0x311596d004a, negotiated timeout=2
Oct 13 12:03:21.275693 zookeeper - INFO  
[QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
session 0x311596d004a with negotiated timeout 2 for client 
/10.170.243.4:48523
Oct 13 12:03:24.229590 zookeeper - WARN  [NIOWorkerThread-8:NIOServerCnxn@361] 
- Unable to read additional data from client sessionid 0x311596d004a, 
likely client has closed socket
Oct 13 12:03:24.230018 zookeeper - INFO  [NIOWorkerThread-8:NIOServerCnxn@999] 
- Closed socket connection for client /10.170.243.4:48523 which had sessionid 
0x311596d004a
Oct 13 12:03:24.230257 zookeeper - WARN  [NIOWorkerThread-19:NIOServerCnxn@361] 
- Unable to read additional data from client sessionid 0x12743aa0001, 
likely client has closed socket
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

{{monospaced}}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {{monospaced}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{{monospaced}}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{{monospaced}}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{monospaced}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{monospaced}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:


 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> 
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985978#comment-14985978
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

Another interesting link related to this:

https://bugzilla.kernel.org/show_bug.cgi?id=80981

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
> (this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
> #10 0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #12 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ZOOKEEPER-2311) assert in setup_random

Marshall McMullen created ZOOKEEPER-2311:


 Summary: assert in setup_random
 Key: ZOOKEEPER-2311
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: Marshall McMullen


We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{monospaced}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom"

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{monospaced}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{monospaced}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{{monospaced}}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{{monospaced}}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code|borderStyle=solid}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code|borderStyle=solid}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int f

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code|borderStyle=solid}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int f

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

{{monospaced}}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:


 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {{monospaced}}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open(&

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{
{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{
{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {
> {monospaced}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/de

[jira] [Commented] (ZOOKEEPER-2145) Node can be seen but not deleted

2015-06-16 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589343#comment-14589343
 ] 

Marshall McMullen commented on ZOOKEEPER-2145:
--

Has anyone had a chance to investigate this issue yet?

 Node can be seen but not deleted
 

 Key: ZOOKEEPER-2145
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2145
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
Reporter: Frans Lawaetz

 I have a three-server ensemble that appears to be working fine in every 
 respect but for the fact that I can ls or get a znode but can not rmr it.
 [zk: localhost:2181(CONNECTED) 0] get 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 CLEAN_STOP
 cZxid = 0x15
 ctime = Fri Feb 20 13:37:59 CST 2015
 mZxid = 0x72
 mtime = Fri Feb 20 13:38:05 CST 2015
 pZxid = 0x15
 cversion = 0
 dataVersion = 2
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 10
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 1] rmr 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 Node does not exist: 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 I have run a 'stat' against all three servers and they seem properly 
 structured with a leader and two followers.  An md5sum of all zoo.cfg shows 
 them to be identical.  
 The problem seems localized to the accumulo/935 directory as I can create 
 and delete znodes outside of that path fine but not inside of it.
 For example:
 [zk: localhost:2181(CONNECTED) 12] create 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/fubar asdf
 Node does not exist: /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/fubar
 [zk: localhost:2181(CONNECTED) 13] create /accumulo/fubar asdf
 Created /accumulo/fubar
 [zk: localhost:2181(CONNECTED) 14] ls /accumulo/fubar
 []
 [zk: localhost:2181(CONNECTED) 15] rmr /accumulo/fubar
 [zk: localhost:2181(CONNECTED) 16]
 Here is my zoo.cfg:
 tickTime=2000
 initLimit=10
 syncLimit=15
 dataDir=/data/extera/zkeeper/data
 clientPort=2181
  maxClientCnxns=300
 autopurge.snapRetainCount=10
 autopurge.purgeInterval=1
 server.1=cdf61:2888:3888
 server.2=cdf62:2888:3888
 server.3=cdf63:2888:3888



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2163) Introduce new ZNode type: container

2015-06-01 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568532#comment-14568532
 ] 

Marshall McMullen commented on ZOOKEEPER-2163:
--

~shralex I would be happy to look into this. I probably won't be able to get to 
this until early next week though. But looking through this bug report it seems 
completely unrelated to ZOOKEEPER-2163. Perhaps we should just open a separate 
Jira to track the unstable TestConfig test? In any event, I'll add this to my 
list of things to look into.

 Introduce new ZNode type: container
 ---

 Key: ZOOKEEPER-2163
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client, server
Affects Versions: 3.5.0
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman
 Fix For: 3.6.0

 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, 
 zookeeper-2163.12.patch, zookeeper-2163.13.patch, zookeeper-2163.3.patch, 
 zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, 
 zookeeper-2163.8.patch, zookeeper-2163.9.patch


 BACKGROUND
 
 A recurring problem for ZooKeeper users is garbage collection of parent 
 nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a 
 parent node under which participants create sequential nodes. When the 
 participant is done, it deletes its node. In practice, the ZooKeeper tree 
 begins to fill up with orphaned parent nodes that are no longer needed. The 
 ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can 
 become unstable due to the number of these nodes.
 CURRENT SOLUTIONS
 ===
 Apache Curator has a workaround solution for this by providing the Reaper 
 class which runs in the background looking for orphaned parent nodes and 
 deleting them. This isn’t ideal and it would be better if ZooKeeper supported 
 this directly.
 PROPOSAL
 =
 ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes 
 to contain child nodes. This is not optimum as EPHEMERALs are tied to a 
 session and the general use case of parent nodes is for PERSISTENT nodes. 
 This proposal adds a new node type, CONTAINER. A CONTAINER node is the same 
 as a PERSISTENT node with the additional property that when its last child is 
 deleted, it is deleted (and CONTAINER nodes recursively up the tree are 
 deleted if empty).
 CANONICAL USAGE
 
 {code}
 while ( true) { // or some reasonable limit
 try {
 zk.create(path, ...);
 break;
 } catch ( KeeperException.NoNodeException e ) {
 try {
 zk.createContainer(containerPath, ...);
 } catch ( KeeperException.NodeExistsException ignore) {
}
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2163) Introduce new ZNode type: container

2015-06-01 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568530#comment-14568530
 ] 

Marshall McMullen commented on ZOOKEEPER-2163:
--

~shralex I would be happy to look into this. I probably won't be able to get to 
this until early next week though. But looking through this bug report it seems 
completely unrelated to ZOOKEEPER-2163. Perhaps we should just open a separate 
Jira to track the unstable TestConfig test? In any event, I'll add this to my 
list of things to look into.

 Introduce new ZNode type: container
 ---

 Key: ZOOKEEPER-2163
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client, server
Affects Versions: 3.5.0
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman
 Fix For: 3.6.0

 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, 
 zookeeper-2163.12.patch, zookeeper-2163.13.patch, zookeeper-2163.3.patch, 
 zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, 
 zookeeper-2163.8.patch, zookeeper-2163.9.patch


 BACKGROUND
 
 A recurring problem for ZooKeeper users is garbage collection of parent 
 nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a 
 parent node under which participants create sequential nodes. When the 
 participant is done, it deletes its node. In practice, the ZooKeeper tree 
 begins to fill up with orphaned parent nodes that are no longer needed. The 
 ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can 
 become unstable due to the number of these nodes.
 CURRENT SOLUTIONS
 ===
 Apache Curator has a workaround solution for this by providing the Reaper 
 class which runs in the background looking for orphaned parent nodes and 
 deleting them. This isn’t ideal and it would be better if ZooKeeper supported 
 this directly.
 PROPOSAL
 =
 ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes 
 to contain child nodes. This is not optimum as EPHEMERALs are tied to a 
 session and the general use case of parent nodes is for PERSISTENT nodes. 
 This proposal adds a new node type, CONTAINER. A CONTAINER node is the same 
 as a PERSISTENT node with the additional property that when its last child is 
 deleted, it is deleted (and CONTAINER nodes recursively up the tree are 
 deleted if empty).
 CANONICAL USAGE
 
 {code}
 while ( true) { // or some reasonable limit
 try {
 zk.create(path, ...);
 break;
 } catch ( KeeperException.NoNodeException e ) {
 try {
 zk.createContainer(containerPath, ...);
 } catch ( KeeperException.NodeExistsException ignore) {
}
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Changing sync() to need quorum ack

2015-03-10 Thread Marshall McMullen

+1. This is how we believed sync was implemented already. Getting these
semantics correct would be very important for us.
On Mar 10, 2015 2:57 AM, Flavio Junqueira fpjunque...@yahoo.com.invalid
wrote:

 For one thing, this should clean up the mess that we had to do in the code
 to have sync() the way it is, since it was neither a regular nor a regular
 quorum write. I don't know why you say that it changes the behavior. It
 changes the internal behavior, but the expected behavior exposed through
 the API call remains the same, so no user should care about it, it doesn't
 break any code.

 -Flavio

  On 10 Mar 2015, at 03:31, Hongchao Deng hd...@cloudera.com wrote:
 
  Hi all,
 
  I recently worked on fixing flaky test -- testPortChange(), which is
  related to ZOOKEEPER-2000.
 
  This is what I have figured out:
 
  * Server (1) and (2) were followers, (3) was the leader.
  * client connected to (1), did a reconfig().
  * (1) and (2) formed a quorum, reconfig was successful, and returned.
  * (3) still thinks he's the leader, so using LeaderZooKeeperServer.
  * client connected to (3) did a sync(), and the sync didn't go through a
  quorum. THE CLIENT WHO DID SYNC() GETS WRONG BEHAVIOR. There's a split
  brain here for sync().
  * Then (3) gradually moves to the new quorum config.
 
  I'm proposing to change sync() to need quorum acks. I've privately talked
  with my friend Xiang Li who's working on etcd. He previously had similar
  experience and finally changed sync to go through quorum.
 
  Since this change affects the behavior of sync(), I'm asking in public if
  there's any concern/assumption? Let's discuss it here.
 
  Best,
  --
  *- Hongchao Deng*
  *Software Engineer*

One ensemble node shows massive number of 'Outstanding' requests

2015-02-17 Thread Marshall McMullen

Greetings,

We saw an issue recently that I've never seen before and am hoping I can
get some clarity on what may cause this and whether it's a known issue. We
had a 5 node ensemble and were unable to connect to one of the ZooKeeper
instances.  When trying to connect with zkCli it would timeout. When I
connected via telnet and issued the srvr four letter word, I was surprised
to see that this one server reported a massive number of 'Outstanding'
requests. I'd never seen that really be anything other than 0 before. On
the ZK dev guide it says:

outstanding is the number of queued requests, this increases when the
server is under load and is receiving more sustained requests than it can
process, ie the request queue. I looked at all the ZK servers in my
ensemble:

for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
grep Outstanding; done
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 18876

I eventually killed ZK on the affected server and everything corrected
itself and Outstanding went to zero and I was able to connect again.

Is this something anyone's familiar with? I have logs if it would be
helpful.

Thanks!

Re: Review Request 30573: ZOOKEEPER-1366: Zookeeper should be tolerant of clock adjustments

2015-02-05 Thread Marshall McMullen


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30573/#review71222
---



src/java/main/org/apache/zookeeper/common/Time.java
https://reviews.apache.org/r/30573/#comment116890

I *REALLY* like the addition of the Time class. Nice abstraction layer.



src/java/main/org/apache/zookeeper/common/Time.java
https://reviews.apache.org/r/30573/#comment116891

Can you please format the body of this method like we normally do so it's 
not all on one line?



src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
https://reviews.apache.org/r/30573/#comment116893

Is it worth changing callers of ZooKeeperServer.java's getTime to instead 
call into the new Time.currentWallTime for increased clarity? Or is that a LOT 
of refactoring? I confess I didn't look.


- Marshall McMullen


On Feb. 5, 2015, 12:37 a.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30573/
 ---
 
 (Updated Feb. 5, 2015, 12:37 a.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 Zookeeper should be tolerant of clock adjustments
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/ClientCnxn.java 
 c85cc8d1b6dae0c0d0850d758420fb31a8dd1dcc 
   src/java/main/org/apache/zookeeper/ClientCnxnSocket.java 
 16cb9120686bf982b4c68a0172600d23b6119042 
   src/java/main/org/apache/zookeeper/Login.java 
 6d248ab37a0a6b11358f5f3adc9dc363b1a9c73b 
   src/java/main/org/apache/zookeeper/Shell.java 
 62169d797a7a103d921634c4676fffea878def51 
   src/java/main/org/apache/zookeeper/ZKUtil.java 
 4713a08a934175c2b297f69740e204c7288c078c 
   src/java/main/org/apache/zookeeper/common/Time.java PRE-CREATION 
   src/java/main/org/apache/zookeeper/server/ConnectionBean.java 
 917aacfdcdcd50576029faab65ca98b89cfb2df9 
   src/java/main/org/apache/zookeeper/server/ExpiryQueue.java 
 a037bf49235e386cc20ee68633ec162b1db013d1 
   src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java 
 a97be4a5452006fbd85d355c0dcb16276cbf1c59 
   src/java/main/org/apache/zookeeper/server/RateLogger.java 
 fc951cf5147bedbf1786ff1047a1e1a5fd7f5121 
   src/java/main/org/apache/zookeeper/server/Request.java 
 ee01dcfa63784a9dd380f91d768e1b3f28b9cce9 
   src/java/main/org/apache/zookeeper/server/ServerStats.java 
 c3246293e409d863412144ed76b2a91ca1ac98f2 
   src/java/main/org/apache/zookeeper/server/SessionTrackerImpl.java 
 0c2c042e276c557a86f47d7ab5333e6860e12bd9 
   src/java/main/org/apache/zookeeper/server/WorkerService.java 
 c55ff48f92e5e3ae7783ad5be0262a5d9899c521 
   src/java/main/org/apache/zookeeper/server/ZKDatabase.java 
 f336049f0afb7b539460223b4903d323e2558aed 
   src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java 
 30a0ed390bb7473ddb36757da97bc7d5f4281887 
   
 src/java/main/org/apache/zookeeper/server/quorum/AuthFastLeaderElection.java 
 6cd0af88292d9cb89652f1c6d2a80ec2726b5b6a 
   src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java 
 dfe692f4889a11b8a8eb3a4cbbd150ed5cac6a9f 
   src/java/main/org/apache/zookeeper/server/quorum/Follower.java 
 6dbb0b22a4e0658a6b04629e6efdf1ac722375e5 
   src/java/main/org/apache/zookeeper/server/quorum/Leader.java 
 20589045752a7ba4ae9c9090055a4fcbe86a8eda 
   
 src/java/main/org/apache/zookeeper/server/quorum/LearnerSnapshotThrottler.java
  97b48915321aab6ea31bd7db8fe1197165507feb 
   src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
 388ceeb45bd18c7cb8f0766a96ebd4a54a9e76de 
   src/java/systest/org/apache/zookeeper/test/system/GenerateLoad.java 
 4092c760f2cc4eda410ac6125e58ec399d1a6ca4 
   src/java/systest/org/apache/zookeeper/test/system/InstanceManager.java 
 809fa4819eed61aee3fcee1b5641ec85b967d479 
   src/java/systest/org/apache/zookeeper/test/system/SimpleSysTest.java 
 9cdf4d912a29e8a5341e4a9700fd07e1eeb015f3 
   src/java/test/org/apache/zookeeper/common/TimeTest.java PRE-CREATION 
   src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java 
 9abe47910f5d73195c57e9f33d9d2150a4861141 
   src/java/test/org/apache/zookeeper/test/ClientBase.java 
 a6229b50b4a4486b443daa6b3b92ac4ab5cf94cb 
   src/java/test/org/apache/zookeeper/test/ClientHammerTest.java 
 b807dbb0f4350b29190b5d5862c418de84a168c5 
   src/java/test/org/apache/zookeeper/test/CnxManagerTest.java 
 563c77c41c86c692edfd95ea48d397bc25154d26 
   src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java 
 ab84146f58e8f97ef24517703c30ef6015a71c84 
   src/java/test/org/apache/zookeeper/test/ReadOnlyModeTest.java 
 0579858659cec892aee3fa4362d0c55d175d87a7 
   src/java/test/org/apache/zookeeper/test/StaticHostProviderTest.java 
 bf1dcef7fbca91fee6128096e8413013fa11e0e0 
   src/java/test/org/apache/zookeeper

[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2015-02-05 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307646#comment-14307646
 ] 

Marshall McMullen commented on ZOOKEEPER-1366:
--

Latest version looks great to me. 

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Hongchao Deng
Priority: Critical
 Fix For: 3.5.1

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 zookeeper-3.4.5-ZK1366-SC01.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2015-02-04 Thread Marshall McMullen (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304763#comment-14304763
]

Marshall McMullen commented on ZOOKEEPER-1366:
--

[~hdeng] - I will be happy to help review this tomorrow. It's important to us
to pick up this fix as well so I'd love to see this rolled into the 3.5
release. I'll make sure to review this and add comments to the review tomorrow.

Zookeeper should be tolerant of clock adjustments
-

Key: ZOOKEEPER-1366
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
Project: ZooKeeper
Issue Type: Bug
Reporter: Ted Dunning
Assignee: Hongchao Deng
Priority: Critical
Fix For: 3.5.1

Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch,
ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch,
ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch,
ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch,
ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, zookeeper-3.4.5-ZK1366-SC01.patch

If you want to wreak havoc on a ZK based system just do [date -s +1hour]
and watch the mayhem as all sessions expire at once.
This shouldn't happen. Zookeeper could easily know handle elapsed times as
elapsed times rather than as differences between absolute times. The
absolute times are subject to adjustment when the clock is set while a timer
is not subject to this problem. In Java, System.currentTimeMillis() gives
you absolute time while System.nanoTime() gives you time based on a timer
from an arbitrary epoch.
I have done this and have been running tests now for some tens of minutes
with no failures. I will set up a test machine to redo the build again on
Ubuntu and post a patch here for discussion.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-14 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171988#comment-14171988
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I'm going to go look over the final version of this patch on RB, but I think 
you guys have absolutely nailed this problem. I would I could give some useful 
insight into why it was originally implemented this way but I think it was just 
an oversight on our part. The particular use case of deleting a multi with 
intermixed ephemeral nodes is one we would never have encountered or tested 
against and thus I probably just didn't think of that... Anyhow, great find.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Fix For: 3.4.7, 3.5.1, 3.6.0

 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 26437: ZooKeeper-2052

2014-10-14 Thread Marshall McMullen


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26437/#review56658
---



src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java
https://reviews.apache.org/r/26437/#comment97061

Thanks for adding this comment here.



src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java
https://reviews.apache.org/r/26437/#comment97062

Really good additional tests. Nice job.


- Marshall McMullen


On Oct. 8, 2014, 9:18 p.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26437/
 ---
 
 (Updated Oct. 8, 2014, 9:18 p.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 ZooKeeper-2052
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 8542790 
   src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java 
 8caf419 
   src/java/test/org/apache/zookeeper/test/ClientBase.java a6229b5 
   src/java/test/org/apache/zookeeper/test/MultiTransactionTest.java a573180 
 
 Diff: https://reviews.apache.org/r/26437/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Hongchao Deng

[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-14 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171996#comment-14171996
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I reviewed the RB and the changes look solid to me. +1 from me.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Fix For: 3.4.7, 3.5.1, 3.6.0

 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 26437: ZooKeeper-2052

2014-10-14 Thread Marshall McMullen


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26437/#review56660
---

Ship it!


Ship It!

- Marshall McMullen


On Oct. 8, 2014, 9:18 p.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26437/
 ---
 
 (Updated Oct. 8, 2014, 9:18 p.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 ZooKeeper-2052
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 8542790 
   src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java 
 8caf419 
   src/java/test/org/apache/zookeeper/test/ClientBase.java a6229b5 
   src/java/test/org/apache/zookeeper/test/MultiTransactionTest.java a573180 
 
 Diff: https://reviews.apache.org/r/26437/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Hongchao Deng

[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-10 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166509#comment-14166509
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I'm just seeing this jira for the first time as well. It looks like a really 
fantastic find and definitely very concerning if the issue is indeed as you 
describe. I'm pretty swamped at work at present so it may take me a few days 
before I'll have a chance to dig into this but I'll be very happy to do so... 
Will update when I've had a chance to digest this issue and comment on it.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2014-09-25 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147804#comment-14147804
 ] 

Marshall McMullen commented on ZOOKEEPER-1636:
--

Fantastic find, patch and unit tests. Looks like great hardening around this 
code path to me. Nice job.

 c-client crash when zoo_amulti failed 
 --

 Key: ZOOKEEPER-1636
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.3
Reporter: Thawan Kooburat
Assignee: Thawan Kooburat
Priority: Critical
 Fix For: 3.4.7, 3.5.1

 Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
 ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch


 deserialize_response for multi operation don't handle the case where the 
 server fail to send back response. (Eg. when multi packet is too large) 
 c-client will try to process completion of all sub-request as if the 
 operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2016) Automate client-side rebalancing

2014-08-21 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105825#comment-14105825
 ] 

Marshall McMullen commented on ZOOKEEPER-2016:
--

[~shralex] - I agree this sounds useful but only if it is something we can 
opt-in for. Lots of application code which sits on top of the C bindings may 
prefer to have more direct control over this than having it automatically 
rebalance for them. 

 Automate client-side rebalancing
 

 Key: ZOOKEEPER-2016
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2016
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Hongchao Deng

 ZOOKEEPER-1355 introduced client-side rebalancing, which is implemented in 
 both the C and Java client libraries. However, it requires the client to 
 detect a configuration change and call updateServerList with the new 
 connection string (see reconfig manual). It may be better if the client just 
 indicates that he is interested in this feature when creating a ZK handle and 
 we'll detect configuration changes and invoke updateServerList for him 
 underneath the hood.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1994) Backup config files.

2014-08-01 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082700#comment-14082700
 ] 

Marshall McMullen commented on ZOOKEEPER-1994:
--

I strongly agree with Alex on this as well. I would like them to be named using 
zxid as well. As Alex explained, that is much safer from a consistency point of 
view and much easier to correlate to the reconfiguration as well as different 
replicas. +1 from me.

 Backup config files.
 

 Key: ZOOKEEPER-1994
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1994
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.5.0
Reporter: Hongchao Deng
Assignee: Hongchao Deng
 Fix For: 3.5.0


 We should create a backup file for a static or dynamic configuration file 
 before changing the file. 
 Since the static file is changed at most twice (once when removing the 
 ensemble definitions, at which point a dynamic file doesn't exist yet, and 
 once when removing clientPort information) its probably fine to back up the 
 static file independently from the dynamic file. 
 To track backup history:
 Option 1: we could have a .bakXX extention for backup where XX is a  sequence 
 number. 
 Option 2: have the configuration version be part of the file name for dynamic 
 configuration files (instead of in the file like now). Such as 
 zoo_replicated1.cfg.dynamic.100 then on reconfiguration simply create a 
 new dynamic file (with new version) and update the link in the static file to 
 point to the new dynamic one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1998) C library calls getaddrinfo unconditionally from zookeeper_interest

2014-07-29 Thread Marshall McMullen (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078237#comment-14078237
]

Marshall McMullen commented on ZOOKEEPER-1998:
--

[~rgs] - yep, you're right. I added that code as part of ZOOKEEPER-107 working
with [~shralex]. But if I recall correctly, the original code also
unconditionally called resolve_hosts. Though I'd have to go look at the
original code to confirm that. I'm guessing you've done that already and that
it did not do that?

Do you have thoughts on how we could avoid this? I suppose we could easily just
check if the addrvec is the same and if it is bypass resolving the hosts.

C library calls getaddrinfo unconditionally from zookeeper_interest
---

Key: ZOOKEEPER-1998
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1998
Project: ZooKeeper
Issue Type: Bug
Components: c client
Affects Versions: 3.5.0
Reporter: Raul Gutierrez Segales
Assignee: Raul Gutierrez Segales
Priority: Critical
Fix For: 3.5.0

(commented this on ZOOKEEPER-338)
I've just noticed that we call getaddrinfo from zookeeper_interest... on
every call. So from zookeeper_interest we always call update_addrs:
https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L2082
which in turns unconditionally calls resolve_hosts:
https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L787
which does the unconditional calls to getaddrinfo:
https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L648
We should fix this since it'll make 3.5.0 slower for people relying on DNS. I
think this is happened as part of ZOOKEEPER-107 in which the list of servers
can be updated.
cc: [~shralex], [~phunt], [~fpj]

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1998) C library calls getaddrinfo unconditionally from zookeeper_interest

2014-07-29 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078406#comment-14078406
 ] 

Marshall McMullen commented on ZOOKEEPER-1998:
--

[~rgs] - Looking at the 3.4 code I agree with you. It seems like we should only 
do the lookup when we are connecting.

 C library calls getaddrinfo unconditionally from zookeeper_interest
 ---

 Key: ZOOKEEPER-1998
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1998
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.0
Reporter: Raul Gutierrez Segales
Assignee: Raul Gutierrez Segales
Priority: Critical
 Fix For: 3.5.0


 (commented this on ZOOKEEPER-338)
 I've just noticed that we call getaddrinfo from zookeeper_interest... on 
 every call. So from zookeeper_interest we always call update_addrs:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L2082
 which in turns unconditionally calls resolve_hosts:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L787
 which does the unconditional calls to getaddrinfo:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L648
 We should fix this since it'll make 3.5.0 slower for people relying on DNS. I 
 think this is happened as part of ZOOKEEPER-107 in which the list of servers 
 can be updated. 
 cc: [~shralex], [~phunt], [~fpj]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1997) Why is there a standalone mode

2014-07-28 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076767#comment-14076767
 ] 

Marshall McMullen commented on ZOOKEEPER-1997:
--

With reconfig you still cannot grow from standalone to quorum mode. There are 
many many use cases for the standalone mode -- most notable for embedded unit 
tests or for non-HA clusters which are use for simulations or test environments 
where we don't need quorum mode.

 Why is there a standalone mode
 --

 Key: ZOOKEEPER-1997
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1997
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Hongchao Deng

 It seems there is a special standalone mode.
 With the coming of reconfig, this doesn't make any sense.
 A single server can also be configured later to add more servers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-19 Thread Marshall McMullen (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037586#comment-14037586
]

Marshall McMullen commented on ZOOKEEPER-1934:
--

[~rgs] - No, we are not using local sessions.

Stale data received from sync'd ensemble peer
-

Key: ZOOKEEPER-1934
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Marshall McMullen
Attachments: node1.log, node2.log, node3.log, node4.log, node5.log

In our regression testing we encountered an error wherein we were caching a
value we read from zookeeper and then experienced session loss. We
subsequently got reconnected to a different zookeeper server. When we tried
to read the same path from this new zookeeper server we are getting a stale
value.
Specifically, we are reading /binchanges and originally got back a value of
3 from the first server. After we lost connection and reconnected before
the session timeout, we then read /binchanges from the new server and got
back a value of 2. In our code path we never set this value from 3 to 2. We
throw an assertion if the value ever goes backwards. Which is how we caught
this error.
It's my understanding of the single system image guarantee that this should
never be allowed. I realize that the single system image guarantee is still
quorum based and it's certainly possible that a minority of the ensemble may
have stale data. However, I also believe that each client has to send the
highest zxid it's seen as part of its connection request to the server. And
if the server it's connecting to has a smaller zxid than the value the client
sends, then the connection request should be refused.
Assuming I have all of that correct, then I'm at a loss for how this
happened.
The failure happened around Jun 4 08:13:44. Just before that, at June 4
08:13:30 there was a round of leader election. During that round of leader
election we voted server with id=4 and zxid=0x31c4c. This then led to a
new zxid=0x40001. The new leader sends a diff to all the servers
including the one we will soon read the stale data from (id=2). Server with
ID=2's log files also reflect that as of 08:13:43 it was up to date and
current with an UPTODATE message.
I'm going to attach log files from all 5 ensemble nodes. I also used
zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those,
and compared them all for correctness. 1 of the nodes (id=2) has a massively
divergent zktreeutil dump than the other 4 nodes even though it received the
diff from the new leader.
In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4.log.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen reassigned ZOOKEEPER-1937:


Assignee: Marshall McMullen

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen

 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029565#comment-14029565
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

Patch submitted. The one in the rpm directory was actually already using bash, 
but it didn't follow our convention of using /usr/bin/env so I fixed that one 
as well to be consistent.

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1937:
-

Attachment: ZOOKEEPER-1719.patch

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029659#comment-14029659
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

No new unit tests added as this only changes the shebang at the top of some 
unused init scripts. The test failure can't possibly be related. But looks very 
troubling:

 [exec]  [exec] *** glibc detected *** ./zktest-mt: free(): invalid 
pointer: 0x2ba1446d ***


 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-11 Thread Marshall McMullen (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028178#comment-14028178
]

Marshall McMullen commented on ZOOKEEPER-1934:
--

[~michim] - thanks for looking at this issue. I saw the same code you linked to
and agree on the intended behavior. The log message in that block of code is
NOT present.

We did not see /binchanges update to the correct value of 3. It looked to be
stuck at 2. Which really defies explanation.

Stale data received from sync'd ensemble peer
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-10 Thread Marshall McMullen (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027370#comment-14027370
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

[~CpuID] - Yep, looks like the same problem.  I wasn't aware of the file 
src/packages/deb/init.d/zookeeper. But it should probably be fixed in the same 
manner. Do you want to upload a patch? Otherwise I can do so.

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan

 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-09 Thread Marshall McMullen (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marshall McMullen updated ZOOKEEPER-1934:
-

Affects Version/s: 3.5.0

Stale data received from sync'd ensemble peer
-

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

Marshall McMullen created ZOOKEEPER-1934:


 Summary: Stale data received from sync'd ensemble peer
 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log

In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a version of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4_zookeeper.log.







--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marshall McMullen updated ZOOKEEPER-1934:
-

Attachment: node5.log
node4.log
node3.log
node2.log
node1.log

Log files from all 5 ensemble nodes.

Stale data received from sync'd ensemble peer
-

Key: ZOOKEEPER-1934
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
Project: ZooKeeper
Issue Type: Bug
Reporter: Marshall McMullen
Attachments: node1.log, node2.log, node3.log, node4.log, node5.log

In our regression testing we encountered an error wherein we were caching a
value we read from zookeeper and then experienced session loss. We
subsequently got reconnected to a different zookeeper server. When we tried
to read the same path from this new zookeeper server we are getting a stale
value.
Specifically, we are reading /binchanges and originally got back a version
of 4 from the first server. After we lost connection and reconnected before
the session timeout, we then read /binchanges from the new server and got
back a value of 3.
It's my understanding of the single system image guarantee that this should
never be allowed. I realize that the single system image guarantee is still
quorum based and it's certainly possible that a minority of the ensemble may
have stale data. However, I also believe that each client has to send the
highest zxid it's seen as part of its connection request to the server. And
if the server it's connecting to has a smaller zxid than the value the client
sends, then the connection request should be refused.
Assuming I have all of that correct, then I'm at a loss for how this
happened.
The failure happened around Jun 4 08:13:44. Just before that, at June 4
08:13:30 there was a round of leader election. During that round of leader
election we voted server with id=4 and zxid=0x31c4c. This then led to a
new zxid=0x40001. The new leader sends a diff to all the servers
including the one we will soon read the stale data from (id=2). Server with
ID=2's log files also reflect that as of 08:13:43 it was up to date and
current with an UPTODATE message.
I'm going to attach log files from all 5 ensemble nodes. I also used
zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those,
and compared them all for correctness. 1 of the nodes (id=2) has a massively
divergent zktreeutil dump than the other 4 nodes even though it received the
diff from the new leader.
In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4_zookeeper.log.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019389#comment-14019389
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

Diffing the zktreeutil dumps of each server is also interesting. There are a 
few minor differences with local sessions:

diff -a node1.zktree node3.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9162,9163d9159
 |   
 |--[72058779056865292]

diff -a node1.zktree node4.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9005,9006d9002
 |--[216173168961912851]
 |   
9162,9163d9157
 |   
 |--[72058779056865292]

diff -a node1.zktree node5.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9005,9006d9002
 |--[216173168961912851]
 |   
9065,9066d9060
 |--[288230547757793293]
 |   
9162,9163d9155
 |   
 |--[72058779056865292]

Whereas node2 is MASSIVELY different.

In particular, the /binchanges value is different:

|--[binchanges] |--[binchanges]
|   |   |   |   
|   |--[version = 3] | |   |--[version 
= 2]




 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a version 
 of 4 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 3. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4_zookeeper.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019392#comment-14019392
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

Yet before we grabbed this data, the offending node (nodeid=2, myid=2) stated 
this:

1723 Jun  4 08:13:30 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:ZooKeeperServer@156] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/sf/data/zoo
1724 Jun  4 08:13:30 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Follower@66] - FOLLOWING - LEADER ELECTION 
TOOK - -1401867542249
1725 Jun  4 08:13:30 zookeeper - WARN  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@240] - Unexpected exception, 
tries=0, connecting to /10.26.65.103:2182
1726 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:232)
1727 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:74)
1728 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:967)
1729 Jun  4 08:13:30 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:35987
1730 Jun  4 08:13:30 zookeeper - WARN  [NIOWorkerThread-37:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1731 Jun  4 08:13:30 zookeeper - INFO  [NIOWorkerThread-37:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:35987 (no session established 
for client)
1732 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:59764
1733 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-38:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1734 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-38:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:59764 (no session established 
for client)
1735 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:51005
1736 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-39:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1737 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-39:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:51005 (no session established 
for client)
1738 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:39628
1739 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-40:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1740 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-40:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:39628 (no session established 
for client)
1741 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:47705
1742 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-41:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1743 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-41:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:47705 (no session established 
for client)
1744 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:34353
1745 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-42:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1746 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-42:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:34353 (no session established 
for client)
1747 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@332] - Getting a diff from the 
leader 0x31c4c
1748 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@475] - Learner received NEWLEADER 
message
1749 Jun  4 08:13:31 zookeeper - WARN  
[QuorumPeer[myid=2]/10.26.65.47:2181:QuorumPeer@1271] - 
setLastSeenQuorumVerifier called with stale config 4294967296. Current version: 
4294967296
1750 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:FileTxnSnapLog@297] - Snapshotting: 
0x31c4c to /sf/data/zookeeper/10.26.65.47/version-2/snapshot.31c4c
1751 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@460] - Learner received

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marshall McMullen updated ZOOKEEPER-1934:
-

Description:
In our regression testing we encountered an error wherein we were caching a
value we read from zookeeper and then experienced session loss. We subsequently
got reconnected to a different zookeeper server. When we tried to read the same
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of
4 from the first server. After we lost connection and reconnected before the
session timeout, we then read /binchanges from the new server and got back a
value of 3.

It's my understanding of the single system image guarantee that this should
never be allowed. I realize that the single system image guarantee is still
quorum based and it's certainly possible that a minority of the ensemble may
have stale data. However, I also believe that each client has to send the
highest zxid it's seen as part of its connection request to the server. And if
the server it's connecting to has a smaller zxid than the value the client
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened.

The failure happened around Jun 4 08:13:44. Just before that, at June 4
08:13:30 there was a round of leader election. During that round of leader
election we voted server with id=4 and zxid=0x31c4c. This then led to a new
zxid=0x40001. The new leader sends a diff to all the servers including the
one we will soon read the stale data from (id=2). Server with ID=2's log files
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared
them all for correctness. 1 of the nodes (id=2) has a massively divergent
zktreeutil dump than the other 4 nodes even though it received the diff from
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4_zookeeper.log.

was:
In our regression testing we encountered an error wherein we were caching a
value we read from zookeeper and then experienced session loss. We subsequently
got reconnected to a different zookeeper server. When we tried to read the same
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a version of
4 from the first server. After we lost connection and reconnected before the
session timeout, we then read /binchanges from the new server and got back a
value of 3.

Assuming I have all of that correct, then I'm at a loss for how this happened.

In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4_zookeeper.log.

Stale data received from sync'd ensemble peer
-

In our regression testing we encountered an error wherein we were caching a
value we read from zookeeper and then experienced session loss. We
subsequently got reconnected to a different

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marshall McMullen updated ZOOKEEPER-1934:
-

Assuming I have all of that correct, then I'm at a loss for how this happened.

In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4.log.

Assuming I have all of that correct, then I'm at a loss for how this happened.

In the attachments there are 5 nodes. I will number each log file by it's
zookeeper id, e.g. node4_zookeeper.log.

Stale data received from sync'd ensemble peer
-

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer