Re: [ANNOUNCE] New ZooKeeper committer: Michael Han

2017-01-03 Thread Marshall McMullen
Congrats Michael! Well deserved.

On Tue, Jan 3, 2017 at 1:16 PM, Abraham Fine  wrote:

> Congratulations Michael!
>
> On Tue, Jan 3, 2017, at 11:40, Jordan Zimmerman wrote:
> > Saludos!
> >
> > > On Jan 3, 2017, at 2:29 PM, Patrick Hunt  wrote:
> > >
> > > The Apache ZooKeeper PMC recently extended committer karma to Michael
> and
> > > he has accepted. Michael has made some great contributions and we are
> > > looking forward to even more :)
> > >
> > > Congratulations and welcome aboard, Michael!
> > > Patrick
> >
>


[jira] [Commented] (ZOOKEEPER-2455) unexpected server response ZRUNTIMEINCONSISTENCY

2016-06-29 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355782#comment-15355782
 ] 

Marshall McMullen commented on ZOOKEEPER-2455:
--

Oh, neat! I was not aware of that. Thanks for filling in the gaps for me Alex.

> unexpected server response ZRUNTIMEINCONSISTENCY
> 
>
> Key: ZOOKEEPER-2455
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2455
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.1
>Reporter: pradeep
> Fix For: 3.5.3, 3.6.0
>
>
> Hi Folks,
> I am hitting an error in my C client code and below are the set of operations 
> I perform:
>   1.  Zookeeper Client connected to Zookeeper server S1 and a new server S2 
> gets added.
>   2.  monitor zookeeper server config at the client and on change of server 
> config, call zoo_set_server
> from the client
>   3.  client can issue operations like zoo_get just after the call to 
> zoo_set_servers
>   4.  I can see that the zookeeper thread logs connect to the new server just 
> after the zoo_get
> call
> 2016-04-11 03:46:50,655:1207(0xf26ffb40):ZOO_INFO@check_events@2345: 
> initiated connection
> to server [128.0.0.5:61728]
> 2016-04-11 03:46:50,658:1207(0xf26ffb40):ZOO_INFO@check_events@2397: session 
> establishment
> complete on server [128.0.0.5:61728], sessionId=0x401852c000c, negotiated 
> timeout=2
>   5.  Some times I find errors like below:
> 2016-04-11 
> 03:46:50,662:1207(0xf26ffb40):ZOO_ERROR@handle_socket_error_msg@2923: Socket 
> [128.0.0.5:61728]
> zk retcode=-2, errno=115(Operation now in progress): unexpected server 
> response: expected
> 0x570b82fa, but received 0x570b82f9
>   1.
> zoo_get returns (-2) indicating that 
> ZRUNTIMEINCONSISTENCY<http://zookeeper.sourcearchive.com/documentation/3.2.2plus-pdfsg3/zookeeper_8h_bb1a0a179f313b2e44ee92369c438a4c.html#bb1a0a179f313b2e44ee92369c438a4c9eabb281ab14c74db3aff9ab456fa7fe>
> What is the issue here? should I be retry the operation zoo_get operation? Or 
> should I wait
> for the zoo_set_server to complete (like wait for the connection 
> establishment notification)
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2455) unexpected server response ZRUNTIMEINCONSISTENCY

2016-06-29 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355566#comment-15355566
 ] 

Marshall McMullen commented on ZOOKEEPER-2455:
--

I'm confused. I didn't think you could do a dynamic reconfig from 1 server. 1 
server is what is called "standalone" mode whereas 3 or more puts you into 
"quorum" mode. And you cannot cross between these two stacks. Perhaps there was 
a change made in the reconfig code that I'm not aware of that let's you do this 
but I don't think so. [~shralex] would be able to say for certain.  Are you 
calling zoo_set_servers and giving it a new server that's not part of the 
ensemble? That would certainly cause this problem. Come to think of it, I don't 
think that there's any protection against that sort of misuse.

> unexpected server response ZRUNTIMEINCONSISTENCY
> 
>
> Key: ZOOKEEPER-2455
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2455
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.1
>Reporter: pradeep
> Fix For: 3.5.3, 3.6.0
>
>
> Hi Folks,
> I am hitting an error in my C client code and below are the set of operations 
> I perform:
>   1.  Zookeeper Client connected to Zookeeper server S1 and a new server S2 
> gets added.
>   2.  monitor zookeeper server config at the client and on change of server 
> config, call zoo_set_server
> from the client
>   3.  client can issue operations like zoo_get just after the call to 
> zoo_set_servers
>   4.  I can see that the zookeeper thread logs connect to the new server just 
> after the zoo_get
> call
> 2016-04-11 03:46:50,655:1207(0xf26ffb40):ZOO_INFO@check_events@2345: 
> initiated connection
> to server [128.0.0.5:61728]
> 2016-04-11 03:46:50,658:1207(0xf26ffb40):ZOO_INFO@check_events@2397: session 
> establishment
> complete on server [128.0.0.5:61728], sessionId=0x401852c000c, negotiated 
> timeout=2
>   5.  Some times I find errors like below:
> 2016-04-11 
> 03:46:50,662:1207(0xf26ffb40):ZOO_ERROR@handle_socket_error_msg@2923: Socket 
> [128.0.0.5:61728]
> zk retcode=-2, errno=115(Operation now in progress): unexpected server 
> response: expected
> 0x570b82fa, but received 0x570b82f9
>   1.
> zoo_get returns (-2) indicating that 
> ZRUNTIMEINCONSISTENCY<http://zookeeper.sourcearchive.com/documentation/3.2.2plus-pdfsg3/zookeeper_8h_bb1a0a179f313b2e44ee92369c438a4c.html#bb1a0a179f313b2e44ee92369c438a4c9eabb281ab14c74db3aff9ab456fa7fe>
> What is the issue here? should I be retry the operation zoo_get operation? Or 
> should I wait
> for the zoo_set_server to complete (like wait for the connection 
> establishment notification)
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-06-08 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321602#comment-15321602
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

Assigning this to [~makuchta] as he's been working this issue for us.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-1485) client xid overflow is not handled

2016-06-08 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1485:
-
Assignee: Martin Kuchta  (was: Bruce Gao)

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Martin Kuchta
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Unable to contribute on JIRA

2016-06-08 Thread Marshall McMullen
Yep, it works now. I was able to assign the Jira to Martin without problems
now. Again, thanks.

On Wed, Jun 8, 2016 at 4:33 PM, Marshall McMullen <
marshall.mcmul...@gmail.com> wrote:

> Thank you very much for the assistance Patric.
>
> On Wed, Jun 8, 2016 at 4:32 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> I've added Martin as a contributor, give it another try.
>>
>> Patrick
>>
>> On Wed, Jun 8, 2016 at 3:21 PM, Marshall McMullen <
>> marshall.mcmul...@gmail.com> wrote:
>>
>> > That makes sense. I would appreciate if a committer can change Martin's
>> > role to be contributer. Otherwise we'll reach out to the Infra team to
>> get
>> > some assistance on that.
>> >
>> > Thanks!
>> >
>> > On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:
>> >
>> > > I think someone (a committer probably only) just needs make Martin as
>> a
>> > > 'contributor' role.
>> > >
>> > > The best way to contact Apache Infra is through their Hipchat channel
>> > > http://www.apache.org/dev/infra-contact
>> > >
>> > > On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
>> > > marshall.mcmul...@gmail.com> wrote:
>> > >
>> > > > Should Martin contact the "Apache Infrastructure Team" regarding
>> this?
>> > If
>> > > > so, how does he do that?
>> > > >
>> > > > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
>> > > > marshall.mcmul...@gmail.com> wrote:
>> > > >
>> > > > > I tried to assign this Jira to him and got an error message back:
>> > > > >
>> > > > > User 'makuchta' cannot be assigned issues.
>> > > > >
>> > > > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com>
>> > wrote:
>> > > > >
>> > > > >> Martin,
>> > > > >>
>> > > > >> I had met similar issue earlier, here is an email sent earlier to
>> > dev
>> > > > >> list:
>> > > > >>
>> > > > >> >>
>> > > > >> FYI, I met an issue today that I can't attach files to a JIRA
>> issue
>> > > with
>> > > > >> the role of 'contributor'. Contacted Apache Infrastructure team
>> and
>> > > > >> confirmed that:
>> > > > >>
>> > > > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
>> > > *committer*
>> > > > >> can attach file.
>> > > > >> - A contributor can only attach files to issues that's assigned
>> > and/or
>> > > > >> reporting to the contributor.
>> > > > >> - A workaround for a contributor to attach files to any issue is
>> to
>> > > > first
>> > > > >> change assignee to the contributor, then attach files, then
>> change
>> > the
>> > > > >> assignee back.
>> > > > >> >>
>> > > > >>
>> > > > >> I think someone just need to assign ZOOKEEPER-2355 to you since
>> you
>> > > are
>> > > > >> working on it.
>> > > > >>
>> > > > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
>> > > mar...@martinkuchta.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Hi,
>> > > > >> >
>> > > > >> > Does anyone know if I need to do anything special to have the
>> > > ability
>> > > > to
>> > > > >> > submit attachments and be assigned issues on JIRA? I was
>> recently
>> > > > >> trying to
>> > > > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
>> > > missing
>> > > > >> for
>> > > > >> > me. It's not present on any other ZooKeeper JIRAs that I can
>> see,
>> > > > >> although
>> > > > >> > I can see it on JIRAs from other Apache projects.
>> > > > >> >
>> > > > >> > I was working with Marshall McMullen to get the patch
>> submitted,
>> > and
>> > > > our
>> > > > >> > first thought was that the issue might need to be assigned to
>> me,
>> > > but
>> > > > >> even
>> > > > >> > though he was able to reassign the issue, I was not a valid
>> user
>> > to
>> > > > >> assign
>> > > > >> > it to.
>> > > > >> >
>> > > > >> > My account username is makuchta. I created it almost two weeks
>> ago
>> > > if
>> > > > >> > that's of any relevance.
>> > > > >> >
>> > > > >> >
>> > > > >> > Thanks,
>> > > > >> >
>> > > > >> > Martin
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Cheers
>> > > > >> Michael.
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Cheers
>> > > Michael.
>> > >
>> >
>>
>
>


[jira] [Updated] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2355:
-
Assignee: Martin Kuchta  (was: Marshall McMullen)

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Martin Kuchta
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Unable to contribute on JIRA

2016-06-08 Thread Marshall McMullen
Thank you very much for the assistance Patric.

On Wed, Jun 8, 2016 at 4:32 PM, Patrick Hunt <ph...@apache.org> wrote:

> I've added Martin as a contributor, give it another try.
>
> Patrick
>
> On Wed, Jun 8, 2016 at 3:21 PM, Marshall McMullen <
> marshall.mcmul...@gmail.com> wrote:
>
> > That makes sense. I would appreciate if a committer can change Martin's
> > role to be contributer. Otherwise we'll reach out to the Infra team to
> get
> > some assistance on that.
> >
> > Thanks!
> >
> > On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:
> >
> > > I think someone (a committer probably only) just needs make Martin as a
> > > 'contributor' role.
> > >
> > > The best way to contact Apache Infra is through their Hipchat channel
> > > http://www.apache.org/dev/infra-contact
> > >
> > > On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
> > > marshall.mcmul...@gmail.com> wrote:
> > >
> > > > Should Martin contact the "Apache Infrastructure Team" regarding
> this?
> > If
> > > > so, how does he do that?
> > > >
> > > > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
> > > > marshall.mcmul...@gmail.com> wrote:
> > > >
> > > > > I tried to assign this Jira to him and got an error message back:
> > > > >
> > > > > User 'makuchta' cannot be assigned issues.
> > > > >
> > > > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com>
> > wrote:
> > > > >
> > > > >> Martin,
> > > > >>
> > > > >> I had met similar issue earlier, here is an email sent earlier to
> > dev
> > > > >> list:
> > > > >>
> > > > >> >>
> > > > >> FYI, I met an issue today that I can't attach files to a JIRA
> issue
> > > with
> > > > >> the role of 'contributor'. Contacted Apache Infrastructure team
> and
> > > > >> confirmed that:
> > > > >>
> > > > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
> > > *committer*
> > > > >> can attach file.
> > > > >> - A contributor can only attach files to issues that's assigned
> > and/or
> > > > >> reporting to the contributor.
> > > > >> - A workaround for a contributor to attach files to any issue is
> to
> > > > first
> > > > >> change assignee to the contributor, then attach files, then change
> > the
> > > > >> assignee back.
> > > > >> >>
> > > > >>
> > > > >> I think someone just need to assign ZOOKEEPER-2355 to you since
> you
> > > are
> > > > >> working on it.
> > > > >>
> > > > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
> > > mar...@martinkuchta.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > Does anyone know if I need to do anything special to have the
> > > ability
> > > > to
> > > > >> > submit attachments and be assigned issues on JIRA? I was
> recently
> > > > >> trying to
> > > > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
> > > missing
> > > > >> for
> > > > >> > me. It's not present on any other ZooKeeper JIRAs that I can
> see,
> > > > >> although
> > > > >> > I can see it on JIRAs from other Apache projects.
> > > > >> >
> > > > >> > I was working with Marshall McMullen to get the patch submitted,
> > and
> > > > our
> > > > >> > first thought was that the issue might need to be assigned to
> me,
> > > but
> > > > >> even
> > > > >> > though he was able to reassign the issue, I was not a valid user
> > to
> > > > >> assign
> > > > >> > it to.
> > > > >> >
> > > > >> > My account username is makuchta. I created it almost two weeks
> ago
> > > if
> > > > >> > that's of any relevance.
> > > > >> >
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Martin
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Cheers
> > > > >> Michael.
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers
> > > Michael.
> > >
> >
>


Re: Unable to contribute on JIRA

2016-06-08 Thread Marshall McMullen
That makes sense. I would appreciate if a committer can change Martin's
role to be contributer. Otherwise we'll reach out to the Infra team to get
some assistance on that.

Thanks!

On Wed, Jun 8, 2016 at 4:04 PM, Michael Han <h...@cloudera.com> wrote:

> I think someone (a committer probably only) just needs make Martin as a
> 'contributor' role.
>
> The best way to contact Apache Infra is through their Hipchat channel
> http://www.apache.org/dev/infra-contact
>
> On Wed, Jun 8, 2016 at 3:01 PM, Marshall McMullen <
> marshall.mcmul...@gmail.com> wrote:
>
> > Should Martin contact the "Apache Infrastructure Team" regarding this? If
> > so, how does he do that?
> >
> > On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
> > marshall.mcmul...@gmail.com> wrote:
> >
> > > I tried to assign this Jira to him and got an error message back:
> > >
> > > User 'makuchta' cannot be assigned issues.
> > >
> > > On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:
> > >
> > >> Martin,
> > >>
> > >> I had met similar issue earlier, here is an email sent earlier to dev
> > >> list:
> > >>
> > >> >>
> > >> FYI, I met an issue today that I can't attach files to a JIRA issue
> with
> > >> the role of 'contributor'. Contacted Apache Infrastructure team and
> > >> confirmed that:
> > >>
> > >> - For a given JIRA issue, only *reporter*, or *assignee*, or
> *committer*
> > >> can attach file.
> > >> - A contributor can only attach files to issues that's assigned and/or
> > >> reporting to the contributor.
> > >> - A workaround for a contributor to attach files to any issue is to
> > first
> > >> change assignee to the contributor, then attach files, then change the
> > >> assignee back.
> > >> >>
> > >>
> > >> I think someone just need to assign ZOOKEEPER-2355 to you since you
> are
> > >> working on it.
> > >>
> > >> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <
> mar...@martinkuchta.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Does anyone know if I need to do anything special to have the
> ability
> > to
> > >> > submit attachments and be assigned issues on JIRA? I was recently
> > >> trying to
> > >> > submit a patch for ZOOKEEPER-2355 and realized the option was
> missing
> > >> for
> > >> > me. It's not present on any other ZooKeeper JIRAs that I can see,
> > >> although
> > >> > I can see it on JIRAs from other Apache projects.
> > >> >
> > >> > I was working with Marshall McMullen to get the patch submitted, and
> > our
> > >> > first thought was that the issue might need to be assigned to me,
> but
> > >> even
> > >> > though he was able to reassign the issue, I was not a valid user to
> > >> assign
> > >> > it to.
> > >> >
> > >> > My account username is makuchta. I created it almost two weeks ago
> if
> > >> > that's of any relevance.
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Martin
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Cheers
> > >> Michael.
> > >>
> > >
> > >
> >
>
>
>
> --
> Cheers
> Michael.
>


Re: Unable to contribute on JIRA

2016-06-08 Thread Marshall McMullen
Should Martin contact the "Apache Infrastructure Team" regarding this? If
so, how does he do that?

On Wed, Jun 8, 2016 at 4:00 PM, Marshall McMullen <
marshall.mcmul...@gmail.com> wrote:

> I tried to assign this Jira to him and got an error message back:
>
> User 'makuchta' cannot be assigned issues.
>
> On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:
>
>> Martin,
>>
>> I had met similar issue earlier, here is an email sent earlier to dev
>> list:
>>
>> >>
>> FYI, I met an issue today that I can't attach files to a JIRA issue with
>> the role of 'contributor'. Contacted Apache Infrastructure team and
>> confirmed that:
>>
>> - For a given JIRA issue, only *reporter*, or *assignee*, or *committer*
>> can attach file.
>> - A contributor can only attach files to issues that's assigned and/or
>> reporting to the contributor.
>> - A workaround for a contributor to attach files to any issue is to first
>> change assignee to the contributor, then attach files, then change the
>> assignee back.
>> >>
>>
>> I think someone just need to assign ZOOKEEPER-2355 to you since you are
>> working on it.
>>
>> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <mar...@martinkuchta.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Does anyone know if I need to do anything special to have the ability to
>> > submit attachments and be assigned issues on JIRA? I was recently
>> trying to
>> > submit a patch for ZOOKEEPER-2355 and realized the option was missing
>> for
>> > me. It's not present on any other ZooKeeper JIRAs that I can see,
>> although
>> > I can see it on JIRAs from other Apache projects.
>> >
>> > I was working with Marshall McMullen to get the patch submitted, and our
>> > first thought was that the issue might need to be assigned to me, but
>> even
>> > though he was able to reassign the issue, I was not a valid user to
>> assign
>> > it to.
>> >
>> > My account username is makuchta. I created it almost two weeks ago if
>> > that's of any relevance.
>> >
>> >
>> > Thanks,
>> >
>> > Martin
>>
>>
>>
>>
>> --
>> Cheers
>> Michael.
>>
>
>


Re: Unable to contribute on JIRA

2016-06-08 Thread Marshall McMullen
I tried to assign this Jira to him and got an error message back:

User 'makuchta' cannot be assigned issues.

On Wed, Jun 8, 2016 at 3:58 PM, Michael Han <h...@cloudera.com> wrote:

> Martin,
>
> I had met similar issue earlier, here is an email sent earlier to dev list:
>
> >>
> FYI, I met an issue today that I can't attach files to a JIRA issue with
> the role of 'contributor'. Contacted Apache Infrastructure team and
> confirmed that:
>
> - For a given JIRA issue, only *reporter*, or *assignee*, or *committer*
> can attach file.
> - A contributor can only attach files to issues that's assigned and/or
> reporting to the contributor.
> - A workaround for a contributor to attach files to any issue is to first
> change assignee to the contributor, then attach files, then change the
> assignee back.
> >>
>
> I think someone just need to assign ZOOKEEPER-2355 to you since you are
> working on it.
>
> On Wed, Jun 8, 2016 at 2:34 PM, Martin Kuchta <mar...@martinkuchta.com>
> wrote:
>
> > Hi,
> >
> > Does anyone know if I need to do anything special to have the ability to
> > submit attachments and be assigned issues on JIRA? I was recently trying
> to
> > submit a patch for ZOOKEEPER-2355 and realized the option was missing for
> > me. It's not present on any other ZooKeeper JIRAs that I can see,
> although
> > I can see it on JIRAs from other Apache projects.
> >
> > I was working with Marshall McMullen to get the patch submitted, and our
> > first thought was that the issue might need to be assigned to me, but
> even
> > though he was able to reassign the issue, I was not a valid user to
> assign
> > it to.
> >
> > My account username is makuchta. I created it almost two weeks ago if
> > that's of any relevance.
> >
> >
> > Thanks,
> >
> > Martin
>
>
>
>
> --
> Cheers
> Michael.
>


[jira] [Comment Edited] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321410#comment-15321410
 ] 

Marshall McMullen edited comment on ZOOKEEPER-2355 at 6/8/16 8:43 PM:
--

[~makuchta] - I'll leave you to investigate the failure reported above.


was (Author: marshall):
@makuchta - I'll leave you to investigate the failure reported above.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15321410#comment-15321410
 ] 

Marshall McMullen commented on ZOOKEEPER-2355:
--

@makuchta - I'll leave you to investigate the failure reported above.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2355:
-
Attachment: ZOOKEEPER-2355-03.patch

Updated patch with Martin's proposed solution.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2016-06-08 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen reassigned ZOOKEEPER-2355:


Assignee: Marshall McMullen  (was: Arshad Mohammad)

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>    Assignee: Marshall McMullen
>Priority: Critical
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-31 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307987#comment-15307987
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

[~fpj] - I agree we should fix ZOOKEEPER-22. Does it make sense to fix this 
case first and then come back to ZOOKEEPER-22? It seems like we should handle 
overflow safely either way and in that regard I think ZOOKEEPER-22 would be 
good follow-on work to do after this one. 

I think the issue that [~makuchta] brought up with regard to closing the 
session is not understanding how the client reacts to having the session closed.

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-27 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304564#comment-15304564
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

[~fanster.z], [~fpj] or [~michim] - any of you have any thoughts on this?

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2152) Intermittent failure in TestReconfig.cc

2016-05-27 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304556#comment-15304556
 ] 

Marshall McMullen commented on ZOOKEEPER-2152:
--

[~makuchta] - This intermittent test failure and the thoughts folks had on this 
may interest you as you're seeing this as well I think.

> Intermittent failure in TestReconfig.cc
> ---
>
> Key: ZOOKEEPER-2152
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: c client
>Reporter: Michi Mutsuzaki
>Assignee: Michael Han
>  Labels: reconfiguration
> Fix For: 3.6.0
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec] 
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
>  Assertion: assertion failed [Expression: found != string::npos, 
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1485) client xid overflow is not handled

2016-05-26 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303532#comment-15303532
 ] 

Marshall McMullen commented on ZOOKEEPER-1485:
--

I think that [~makuchta] is right on this one as well. If he's right, and the 
only purpose of the C client xid is to to track equality of operations 
submitted to the server and the responses that come back, then it seems like 
the simplest, and most correct thing to do here is the following:

1. In get_xid(), we should initialize xid to 0 rather than time(0). Starting at 
zero instead of the time since the epoch ensures we have as much runway as 
possible before we wrap.

2. As Martin suggests, Inside get_xid, if we overflow INT32_MAX, then simply 
set it back to 0. I don't think there's any risk of collisions here since that 
gives us as the maximum amount of digits before wrapping. The odds of an 
existing in-flight operation that happened 2147483647 operations ago still 
lingering around or causing any confusion seems beyond unlikely IMO.

The nice thing about this is we don't have to make any changes to the server or 
worry about compatibility.

[~phunt] what do you think?

> client xid overflow is not handled
> --
>
> Key: ZOOKEEPER-1485
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1485
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client, java client
>Affects Versions: 3.4.3, 3.3.5
>Reporter: Michi Mutsuzaki
>Assignee: Bruce Gao
>
> Both Java and C clients use signed 32-bit int as XIDs. XIDs are assumed to be 
> non-negative, and zookeeper uses some negative values as special XIDs (e.g. 
> -2 for ping, -4 for auth). However, neither Java nor C client ensures the 
> XIDs it generates are non-negative, and the server doesn't reject negative 
> XIDs.
> Pat had some suggestions on how to fix this:
> - (bin-compat) Expire the session when the client sends a negative XID.
> - (bin-incompat) In addition to expiring the session, use 64-bit int for XID 
> so that overflow will practically never happen.
> --Michi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen resolved ZOOKEEPER-2318.
--
Resolution: Duplicate

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-05-26 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303521#comment-15303521
 ] 

Marshall McMullen commented on ZOOKEEPER-2318:
--

I agree with [~makuchta], this looks identical to ZOOKEEPER-1485. The tell-tale 
is that right before this error, in every occurrence we've seen, we see this 
super important indicator of ZOOKEEPER-1485:

{code}
Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in progress): 
unexpected server response: expected 0xfff9, but received 0xfff8
{code}

I'll close this as a duplicate of ZOOKEEPER-1485. Nice sleuthing on this one 
[~makuchta]

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2152) Intermittent failure in TestReconfig.cc

2016-05-05 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272510#comment-15272510
 ] 

Marshall McMullen commented on ZOOKEEPER-2152:
--

[~shralex] and [~hanm] - I've been so unbearably swamped at work the last 6 
months that I've not been able to come up for air at all. I'm happy to help 
advise and review changes on this but don't have the bandwidth to commit to 
working on this myself in the near term. I'm hoping things will quiet down for 
me at work so I can start contributing more here as there are so many things 
I'd like to do! Thanks guys!

> Intermittent failure in TestReconfig.cc
> ---
>
> Key: ZOOKEEPER-2152
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
> Project: ZooKeeper
>  Issue Type: Sub-task
>  Components: c client
>Reporter: Michi Mutsuzaki
>Assignee: Michael Han
>  Labels: reconfiguration
> Fix For: 3.6.0
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec] 
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
>  Assertion: assertion failed [Expression: found != string::npos, 
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2355) ZooKeeper ephemeral node is never deleted if follower fail while reading the proposal packet

2016-01-18 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105914#comment-15105914
 ] 

Marshall McMullen commented on ZOOKEEPER-2355:
--

I wonder if this is the same issue described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-2145

> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> 
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Arshad Mohammad
>Assignee: Arshad Mohammad
>Priority: Critical
> Attachments: ZOOKEEPER-2355-01.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2318) segfault in auth_completion_func

2016-01-06 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085877#comment-15085877
 ] 

Marshall McMullen commented on ZOOKEEPER-2318:
--

Anyone else seeing this? We haven't updated our internal ZooKeeper version in 
quite a while so it's possible this is fixed in newer versions

> segfault in auth_completion_func
> 
>
> Key: ZOOKEEPER-2318
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.5.0
>    Reporter: Marshall McMullen
>
> We have seen some sporadic issues with unexplained segfaults inside 
> auth_completion_func. The interesting thing is we are not using any auth 
> mechanism at all. This happened against this version of the code:
> svn.apache.org/repos/asf/zookeeper/trunk@1547702
> Here's the stacktrace we are seeing:
> {code}
> Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
> #0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
> src/zookeeper.c:1696
> #1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
> src/zookeeper.c:2708
> #2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
> #3  0x7f21eeab7e9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #5  0x in ?? ()
> {code}
> The offending line in our case is:
> 1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s 
> succeeded", zh->auth_h.auth->scheme);
> It must be the case that zh->auth_h.auth is NULL for this to happen since the 
> code path returns if zh is NULL.
> Interesting log messages around this time:
> {code}
> Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in 
> progress): unexpected server response: expected 0xfff9, but received 
> 0xfff8
> Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
> initiated connection to server [10.170.243.4:2181]
> Oct 13 12:03:21.273384 zookeeper - INFO  
> [NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /10.170.243.4:48523
> Oct 13 12:03:21.274321 zookeeper - WARN  
> [NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
> /10.170.243.4:48523; will be dropped if server is in r-o mode
> Oct 13 12:03:21.274452 zookeeper - INFO  
> [NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
> 0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
> server last zxid is 0x30370eb4d
> Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
> Revalidating client: 0x311596d004a
> session establishment complete on server [10.170.243.4:2181], 
> sessionId=0x311596d004a, negotiated timeout=2
> Oct 13 12:03:21.275693 zookeeper - INFO  
> [QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
> session 0x311596d004a with negotiated timeout 2 for client 
> /10.170.243.4:48523
> Oct 13 12:03:24.229590 zookeeper - WARN  
> [NIOWorkerThread-8:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x311596d004a, likely client has closed socket
> Oct 13 12:03:24.230018 zookeeper - INFO  
> [NIOWorkerThread-8:NIOServerCnxn@999] - Closed socket connection for client 
> /10.170.243.4:48523 which had sessionid 0x311596d004a
> Oct 13 12:03:24.230257 zookeeper - WARN  
> [NIOWorkerThread-19:NIOServerCnxn@361] - Unable to read additional data from 
> client sessionid 0x12743aa0001, likely client has closed socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-12-05 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043537#comment-15043537
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

[~rgs] - Yes, I agree. The short read is still a problem. I think the EBADF is 
actually a bug in our application not in ZooKeeper. So unless I discover 
otherwise, I think we should ignore the EBADF for now. I'll open a separate 
jira if I find it's a real issue.

I will regenerate this patch though b/c I didn't create it properly the first 
time.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-12-05 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Attachment: ZOOKEEPER-2311.patch

Updated patch to be generated from the right directory this time.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch, ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-12-01 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034904#comment-15034904
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

I got another recreate of this and this time got a core file. And I was wrong 
originally. It's not a short read that is causing this. Instead the read is 
failing with a return code of -1 and errno is set to EBADF. The manpage for 
read(2) indicates this can only happen when:

{code}
   EBADF  fd is not a valid file descriptor or is not open for reading.
{code}

But we specifically opened it 2 lines of code above that and checked to ensure 
it wasn't -1. 

In the core file I also see that the fd is valid:

{code}
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
476 in src/zookeeper.c
(gdb) print errno
$3 = 9
(gdb) print fd
$4 = 140
(gdb) print seed
$5 = 32671
{code}

It's odd that seed has something in it. That could mean we read _something_, 
but it could also be because this code never initialized seed to zero and it's 
got whatever garbage was on the stack.

The only other thing that's very curious here is that I think when this happens 
it coincides with a call to zookeeper_close. But this is a local stack variable 
so I can't fathom how that could cause this failure scenario.

I'll keep digging.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032536#comment-15032536
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

A very specific LKML thread related to this exact behavior: 
https://lkml.org/lkml/2005/1/13/485

This email thread indicates that there is in general an assumption that reading 
from /dev/urandom will never result in a short read. In actuality, in the face 
of signals, that's not really guaranteed. As with any call to read(2), it must 
handle short reads properly. 

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Attachment: ZOOKEEPER-2311.patch

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-11-30 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032765#comment-15032765
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

Uploaded patch to harden setup_random against short reads from /dev/urandom per 
LKML thread indicating this is a valid non-error path.

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
> Attachments: ZOOKEEPER-2311.patch
>
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> {code}
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ZOOKEEPER-2318) segfault in auth_completion_func

2015-11-09 Thread Marshall McMullen (JIRA)
Marshall McMullen created ZOOKEEPER-2318:


 Summary: segfault in auth_completion_func
 Key: ZOOKEEPER-2318
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2318
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.0
Reporter: Marshall McMullen


We have seen some sporadic issues with unexplained segfaults inside 
auth_completion_func. The interesting thing is we are not using any auth 
mechanism at all. This happened against this version of the code:

svn.apache.org/repos/asf/zookeeper/trunk@1547702

Here's the stacktrace we are seeing:

{code}
Thread 1 (Thread 0x7f21d13ff700 ? (LWP 5230)):
#0  0x7f21efff42f0 in auth_completion_func (rc=0, zh=0x7f21e7470800) at 
src/zookeeper.c:1696
#1  0x7f21efff7898 in zookeeper_process (zh=0x7f21e7470800, events=2) at 
src/zookeeper.c:2708
#2  0x7f21f0006583 in do_io (v=0x7f21e7470800) at src/mt_adaptor.c:440
#3  0x7f21eeab7e9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#4  0x7f21ed1803fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x in ?? ()
{code}

The offending line in our case is:

1696LOG_INFO(LOGCALLBACK(zh), "Authentication scheme %s succeeded", 
zh->auth_h.auth->scheme);

It must be the case that zh->auth_h.auth is NULL for this to happen since the 
code path returns if zh is NULL.

Interesting log messages around this time:

{code}
Socket [10.170.243.7:2181] zk retcode=-2, errno=115(Operation now in progress): 
unexpected server response: expected 0xfff9, but received 0xfff8
Priming connection to [10.170.243.4:2181]: last_zxid=0x370eb4d
initiated connection to server [10.170.243.4:2181]
Oct 13 12:03:21.273384 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.170.243.4:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.170.243.4:48523
Oct 13 12:03:21.274321 zookeeper - WARN  
[NIOWorkerThread-24:ZooKeeperServer@822] - Connection request from old client 
/10.170.243.4:48523; will be dropped if server is in r-o mode
Oct 13 12:03:21.274452 zookeeper - INFO  
[NIOWorkerThread-24:ZooKeeperServer@869] - Client attempting to renew session 
0x311596d004a at /10.170.243.4:48523; client last zxid is 0x30370eb4d; 
server last zxid is 0x30370eb4d
Oct 13 12:03:21.274584 zookeeper - INFO  [NIOWorkerThread-24:Learner@115] - 
Revalidating client: 0x311596d004a
session establishment complete on server [10.170.243.4:2181], 
sessionId=0x311596d004a, negotiated timeout=2
Oct 13 12:03:21.275693 zookeeper - INFO  
[QuorumPeer[myid=1]/10.170.243.4:2181:ZooKeeperServer@611] - Established 
session 0x311596d004a with negotiated timeout 2 for client 
/10.170.243.4:48523
Oct 13 12:03:24.229590 zookeeper - WARN  [NIOWorkerThread-8:NIOServerCnxn@361] 
- Unable to read additional data from client sessionid 0x311596d004a, 
likely client has closed socket
Oct 13 12:03:24.230018 zookeeper - INFO  [NIOWorkerThread-8:NIOServerCnxn@999] 
- Closed socket connection for client /10.170.243.4:48523 which had sessionid 
0x311596d004a
Oct 13 12:03:24.230257 zookeeper - WARN  [NIOWorkerThread-19:NIOServerCnxn@361] 
- Unable to read additional data from client sessionid 0x12743aa0001, 
likely client has closed socket
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

{{monospaced}}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {{monospaced}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{{monospaced}}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{{monospaced}}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{monospaced}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{monospaced}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
> 

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:


 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> 
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  

[jira] [Commented] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985978#comment-14985978
 ] 

Marshall McMullen commented on ZOOKEEPER-2311:
--

Another interesting link related to this:

https://bugzilla.kernel.org/show_bug.cgi?id=80981

> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>    Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536 int rc = read(fd, , sizeof(seed));
>  537 assert(rc == sizeof(seed));
>  538 close(fd);
>  539 }
>  540 srandom(seed);
>  541 srand48(seed);
>  542 #endif
> The core files show:
> Program terminated with signal 6, Aborted.
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x7f9ff6652e42 in __assert_fail () from 
> /lib/x86_64-linux-gnu/libc.so.6
> #4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
> #5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
> hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
> avec=0x7f9fd87fab60) at src/zookeeper.c:730
> #6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
> src/zookeeper.c:801
> #7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
> fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
> src/zookeeper.c:1980
> #8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
> #9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
> (this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
> #10 0x7f9ff804de9a in start_thread () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> #11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #12 0x in ?? ()
> I'm not sure what the underlying cause of this is... But POSIX always allows 
> for a short read(2), and any program MUST check for short reads... 
> Has anyone else encountered this issue? We are seeing it rather frequently 
> which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)
Marshall McMullen created ZOOKEEPER-2311:


 Summary: assert in setup_random
 Key: ZOOKEEPER-2311
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: Marshall McMullen


We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{monospaced}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom"

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);
>  533 if (fd == -1) {
>  534 seed = getpid();
>  535 } else {
>  536  

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{monospaced}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{monospaced}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

{{monospaced}}
Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()
{{monospaced}}

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code|borderStyle=solid}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code|borderStyle=solid}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int f

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{code}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{code|borderStyle=solid}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
{code}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {code}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int f

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

{{monospaced}}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:


 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif


The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {{monospaced}}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open(&

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

{
{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/dev/urandom", O_RDONLY);

[jira] [Updated] (ZOOKEEPER-2311) assert in setup_random

2015-11-02 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-2311:
-
Description: 
We've started seeing an assert failing inside setup_random at line 537:

{
{monospaced}
 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif
}

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.

  was:
We've started seeing an assert failing inside setup_random at line 537:

 528 static void setup_random()
 529 {
 530 #ifndef _WIN32  // TODO: better seed
 531 int seed;
 532 int fd = open("/dev/urandom", O_RDONLY);
 533 if (fd == -1) {
 534 seed = getpid();
 535 } else {
 536 int rc = read(fd, , sizeof(seed));
 537 assert(rc == sizeof(seed));
 538 close(fd);
 539 }
 540 srandom(seed);
 541 srand48(seed);
 542 #endif

The core files show:

Program terminated with signal 6, Aborted.
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x7f9ff665a0d5 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f9ff665d83b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f9ff6652d9e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x7f9ff6652e42 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x7f9ff8e4070a in setup_random () at src/zookeeper.c:476
#5  0x7f9ff8e40d76 in resolve_hosts (zh=0x7f9fe14de400, 
hosts_in=0x7f9fd700f400 "10.26.200.6:2181,10.26.200.7:2181,10.26.200.8:2181", 
avec=0x7f9fd87fab60) at src/zookeeper.c:730
#6  0x7f9ff8e40e87 in update_addrs (zh=0x7f9fe14de400) at 
src/zookeeper.c:801
#7  0x7f9ff8e44176 in zookeeper_interest (zh=0x7f9fe14de400, 
fd=0x7f9fd87fac4c, interest=0x7f9fd87fac50, tv=0x7f9fd87fac80) at 
src/zookeeper.c:1980
#8  0x7f9ff8e553f5 in do_io (v=0x7f9fe14de400) at src/mt_adaptor.c:379
#9  0x020170ac in solidfire::ThreadBacktraces::LaunchThread 
(this=0x7f9ff0c8d500, args=) at shared/ThreadBacktraces.cpp:497
#10 0x7f9ff804de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#11 0x7f9ff671738d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x in ?? ()

I'm not sure what the underlying cause of this is... But POSIX always allows 
for a short read(2), and any program MUST check for short reads... 

Has anyone else encountered this issue? We are seeing it rather frequently 
which is concerning.


> assert in setup_random
> --
>
> Key: ZOOKEEPER-2311
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2311
> Project: ZooKeeper
>  Issue Type: Bug
>      Components: c client
>Reporter: Marshall McMullen
>
> We've started seeing an assert failing inside setup_random at line 537:
> {
> {monospaced}
>  528 static void setup_random()
>  529 {
>  530 #ifndef _WIN32  // TODO: better seed
>  531 int seed;
>  532 int fd = open("/de

[jira] [Commented] (ZOOKEEPER-2145) Node can be seen but not deleted

2015-06-16 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589343#comment-14589343
 ] 

Marshall McMullen commented on ZOOKEEPER-2145:
--

Has anyone had a chance to investigate this issue yet?

 Node can be seen but not deleted
 

 Key: ZOOKEEPER-2145
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2145
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
Reporter: Frans Lawaetz

 I have a three-server ensemble that appears to be working fine in every 
 respect but for the fact that I can ls or get a znode but can not rmr it.
 [zk: localhost:2181(CONNECTED) 0] get 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 CLEAN_STOP
 cZxid = 0x15
 ctime = Fri Feb 20 13:37:59 CST 2015
 mZxid = 0x72
 mtime = Fri Feb 20 13:38:05 CST 2015
 pZxid = 0x15
 cversion = 0
 dataVersion = 2
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 10
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 1] rmr 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 Node does not exist: 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/masters/goal_state
 I have run a 'stat' against all three servers and they seem properly 
 structured with a leader and two followers.  An md5sum of all zoo.cfg shows 
 them to be identical.  
 The problem seems localized to the accumulo/935 directory as I can create 
 and delete znodes outside of that path fine but not inside of it.
 For example:
 [zk: localhost:2181(CONNECTED) 12] create 
 /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/fubar asdf
 Node does not exist: /accumulo/9354e975-7e2a-4207-8c7b-5d36c0e7765d/fubar
 [zk: localhost:2181(CONNECTED) 13] create /accumulo/fubar asdf
 Created /accumulo/fubar
 [zk: localhost:2181(CONNECTED) 14] ls /accumulo/fubar
 []
 [zk: localhost:2181(CONNECTED) 15] rmr /accumulo/fubar
 [zk: localhost:2181(CONNECTED) 16]
 Here is my zoo.cfg:
 tickTime=2000
 initLimit=10
 syncLimit=15
 dataDir=/data/extera/zkeeper/data
 clientPort=2181
  maxClientCnxns=300
 autopurge.snapRetainCount=10
 autopurge.purgeInterval=1
 server.1=cdf61:2888:3888
 server.2=cdf62:2888:3888
 server.3=cdf63:2888:3888



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2163) Introduce new ZNode type: container

2015-06-01 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568532#comment-14568532
 ] 

Marshall McMullen commented on ZOOKEEPER-2163:
--

~shralex I would be happy to look into this. I probably won't be able to get to 
this until early next week though. But looking through this bug report it seems 
completely unrelated to ZOOKEEPER-2163. Perhaps we should just open a separate 
Jira to track the unstable TestConfig test? In any event, I'll add this to my 
list of things to look into.

 Introduce new ZNode type: container
 ---

 Key: ZOOKEEPER-2163
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client, server
Affects Versions: 3.5.0
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman
 Fix For: 3.6.0

 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, 
 zookeeper-2163.12.patch, zookeeper-2163.13.patch, zookeeper-2163.3.patch, 
 zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, 
 zookeeper-2163.8.patch, zookeeper-2163.9.patch


 BACKGROUND
 
 A recurring problem for ZooKeeper users is garbage collection of parent 
 nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a 
 parent node under which participants create sequential nodes. When the 
 participant is done, it deletes its node. In practice, the ZooKeeper tree 
 begins to fill up with orphaned parent nodes that are no longer needed. The 
 ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can 
 become unstable due to the number of these nodes.
 CURRENT SOLUTIONS
 ===
 Apache Curator has a workaround solution for this by providing the Reaper 
 class which runs in the background looking for orphaned parent nodes and 
 deleting them. This isn’t ideal and it would be better if ZooKeeper supported 
 this directly.
 PROPOSAL
 =
 ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes 
 to contain child nodes. This is not optimum as EPHEMERALs are tied to a 
 session and the general use case of parent nodes is for PERSISTENT nodes. 
 This proposal adds a new node type, CONTAINER. A CONTAINER node is the same 
 as a PERSISTENT node with the additional property that when its last child is 
 deleted, it is deleted (and CONTAINER nodes recursively up the tree are 
 deleted if empty).
 CANONICAL USAGE
 
 {code}
 while ( true) { // or some reasonable limit
 try {
 zk.create(path, ...);
 break;
 } catch ( KeeperException.NoNodeException e ) {
 try {
 zk.createContainer(containerPath, ...);
 } catch ( KeeperException.NodeExistsException ignore) {
}
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2163) Introduce new ZNode type: container

2015-06-01 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568530#comment-14568530
 ] 

Marshall McMullen commented on ZOOKEEPER-2163:
--

~shralex I would be happy to look into this. I probably won't be able to get to 
this until early next week though. But looking through this bug report it seems 
completely unrelated to ZOOKEEPER-2163. Perhaps we should just open a separate 
Jira to track the unstable TestConfig test? In any event, I'll add this to my 
list of things to look into.

 Introduce new ZNode type: container
 ---

 Key: ZOOKEEPER-2163
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2163
 Project: ZooKeeper
  Issue Type: New Feature
  Components: c client, java client, server
Affects Versions: 3.5.0
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman
 Fix For: 3.6.0

 Attachments: zookeeper-2163.10.patch, zookeeper-2163.11.patch, 
 zookeeper-2163.12.patch, zookeeper-2163.13.patch, zookeeper-2163.3.patch, 
 zookeeper-2163.5.patch, zookeeper-2163.6.patch, zookeeper-2163.7.patch, 
 zookeeper-2163.8.patch, zookeeper-2163.9.patch


 BACKGROUND
 
 A recurring problem for ZooKeeper users is garbage collection of parent 
 nodes. Many recipes (e.g. locks, leaders, etc.) call for the creation of a 
 parent node under which participants create sequential nodes. When the 
 participant is done, it deletes its node. In practice, the ZooKeeper tree 
 begins to fill up with orphaned parent nodes that are no longer needed. The 
 ZooKeeper APIs don’t provide a way to clean these. Over time, ZooKeeper can 
 become unstable due to the number of these nodes.
 CURRENT SOLUTIONS
 ===
 Apache Curator has a workaround solution for this by providing the Reaper 
 class which runs in the background looking for orphaned parent nodes and 
 deleting them. This isn’t ideal and it would be better if ZooKeeper supported 
 this directly.
 PROPOSAL
 =
 ZOOKEEPER-723 and ZOOKEEPER-834 have been proposed to allow EPHEMERAL nodes 
 to contain child nodes. This is not optimum as EPHEMERALs are tied to a 
 session and the general use case of parent nodes is for PERSISTENT nodes. 
 This proposal adds a new node type, CONTAINER. A CONTAINER node is the same 
 as a PERSISTENT node with the additional property that when its last child is 
 deleted, it is deleted (and CONTAINER nodes recursively up the tree are 
 deleted if empty).
 CANONICAL USAGE
 
 {code}
 while ( true) { // or some reasonable limit
 try {
 zk.create(path, ...);
 break;
 } catch ( KeeperException.NoNodeException e ) {
 try {
 zk.createContainer(containerPath, ...);
 } catch ( KeeperException.NodeExistsException ignore) {
}
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Changing sync() to need quorum ack

2015-03-10 Thread Marshall McMullen
+1. This is how we believed sync was implemented already. Getting these
semantics correct would be very important for us.
On Mar 10, 2015 2:57 AM, Flavio Junqueira fpjunque...@yahoo.com.invalid
wrote:

 For one thing, this should clean up the mess that we had to do in the code
 to have sync() the way it is, since it was neither a regular nor a regular
 quorum write. I don't know why you say that it changes the behavior. It
 changes the internal behavior, but the expected behavior exposed through
 the API call remains the same, so no user should care about it, it doesn't
 break any code.

 -Flavio

  On 10 Mar 2015, at 03:31, Hongchao Deng hd...@cloudera.com wrote:
 
  Hi all,
 
  I recently worked on fixing flaky test -- testPortChange(), which is
  related to ZOOKEEPER-2000.
 
  This is what I have figured out:
 
  * Server (1) and (2) were followers, (3) was the leader.
  * client connected to (1), did a reconfig().
  * (1) and (2) formed a quorum, reconfig was successful, and returned.
  * (3) still thinks he's the leader, so using LeaderZooKeeperServer.
  * client connected to (3) did a sync(), and the sync didn't go through a
  quorum. THE CLIENT WHO DID SYNC() GETS WRONG BEHAVIOR. There's a split
  brain here for sync().
  * Then (3) gradually moves to the new quorum config.
 
  I'm proposing to change sync() to need quorum acks. I've privately talked
  with my friend Xiang Li who's working on etcd. He previously had similar
  experience and finally changed sync to go through quorum.
 
  Since this change affects the behavior of sync(), I'm asking in public if
  there's any concern/assumption? Let's discuss it here.
 
  Best,
  --
  *- Hongchao Deng*
  *Software Engineer*




One ensemble node shows massive number of 'Outstanding' requests

2015-02-17 Thread Marshall McMullen
Greetings,

We saw an issue recently that I've never seen before and am hoping I can
get some clarity on what may cause this and whether it's a known issue. We
had a 5 node ensemble and were unable to connect to one of the ZooKeeper
instances.  When trying to connect with zkCli it would timeout. When I
connected via telnet and issued the srvr four letter word, I was surprised
to see that this one server reported a massive number of 'Outstanding'
requests. I'd never seen that really be anything other than 0 before. On
the ZK dev guide it says:

outstanding is the number of queued requests, this increases when the
server is under load and is receiving more sustained requests than it can
process, ie the request queue. I looked at all the ZK servers in my
ensemble:

for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
grep Outstanding; done
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 18876

I eventually killed ZK on the affected server and everything corrected
itself and Outstanding went to zero and I was able to connect again.

Is this something anyone's familiar with? I have logs if it would be
helpful.

Thanks!


Re: Review Request 30573: ZOOKEEPER-1366: Zookeeper should be tolerant of clock adjustments

2015-02-05 Thread Marshall McMullen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30573/#review71222
---



src/java/main/org/apache/zookeeper/common/Time.java
https://reviews.apache.org/r/30573/#comment116890

I *REALLY* like the addition of the Time class. Nice abstraction layer.



src/java/main/org/apache/zookeeper/common/Time.java
https://reviews.apache.org/r/30573/#comment116891

Can you please format the body of this method like we normally do so it's 
not all on one line?



src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
https://reviews.apache.org/r/30573/#comment116893

Is it worth changing callers of ZooKeeperServer.java's getTime to instead 
call into the new Time.currentWallTime for increased clarity? Or is that a LOT 
of refactoring? I confess I didn't look.


- Marshall McMullen


On Feb. 5, 2015, 12:37 a.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/30573/
 ---
 
 (Updated Feb. 5, 2015, 12:37 a.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 Zookeeper should be tolerant of clock adjustments
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/ClientCnxn.java 
 c85cc8d1b6dae0c0d0850d758420fb31a8dd1dcc 
   src/java/main/org/apache/zookeeper/ClientCnxnSocket.java 
 16cb9120686bf982b4c68a0172600d23b6119042 
   src/java/main/org/apache/zookeeper/Login.java 
 6d248ab37a0a6b11358f5f3adc9dc363b1a9c73b 
   src/java/main/org/apache/zookeeper/Shell.java 
 62169d797a7a103d921634c4676fffea878def51 
   src/java/main/org/apache/zookeeper/ZKUtil.java 
 4713a08a934175c2b297f69740e204c7288c078c 
   src/java/main/org/apache/zookeeper/common/Time.java PRE-CREATION 
   src/java/main/org/apache/zookeeper/server/ConnectionBean.java 
 917aacfdcdcd50576029faab65ca98b89cfb2df9 
   src/java/main/org/apache/zookeeper/server/ExpiryQueue.java 
 a037bf49235e386cc20ee68633ec162b1db013d1 
   src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java 
 a97be4a5452006fbd85d355c0dcb16276cbf1c59 
   src/java/main/org/apache/zookeeper/server/RateLogger.java 
 fc951cf5147bedbf1786ff1047a1e1a5fd7f5121 
   src/java/main/org/apache/zookeeper/server/Request.java 
 ee01dcfa63784a9dd380f91d768e1b3f28b9cce9 
   src/java/main/org/apache/zookeeper/server/ServerStats.java 
 c3246293e409d863412144ed76b2a91ca1ac98f2 
   src/java/main/org/apache/zookeeper/server/SessionTrackerImpl.java 
 0c2c042e276c557a86f47d7ab5333e6860e12bd9 
   src/java/main/org/apache/zookeeper/server/WorkerService.java 
 c55ff48f92e5e3ae7783ad5be0262a5d9899c521 
   src/java/main/org/apache/zookeeper/server/ZKDatabase.java 
 f336049f0afb7b539460223b4903d323e2558aed 
   src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java 
 30a0ed390bb7473ddb36757da97bc7d5f4281887 
   
 src/java/main/org/apache/zookeeper/server/quorum/AuthFastLeaderElection.java 
 6cd0af88292d9cb89652f1c6d2a80ec2726b5b6a 
   src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java 
 dfe692f4889a11b8a8eb3a4cbbd150ed5cac6a9f 
   src/java/main/org/apache/zookeeper/server/quorum/Follower.java 
 6dbb0b22a4e0658a6b04629e6efdf1ac722375e5 
   src/java/main/org/apache/zookeeper/server/quorum/Leader.java 
 20589045752a7ba4ae9c9090055a4fcbe86a8eda 
   
 src/java/main/org/apache/zookeeper/server/quorum/LearnerSnapshotThrottler.java
  97b48915321aab6ea31bd7db8fe1197165507feb 
   src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
 388ceeb45bd18c7cb8f0766a96ebd4a54a9e76de 
   src/java/systest/org/apache/zookeeper/test/system/GenerateLoad.java 
 4092c760f2cc4eda410ac6125e58ec399d1a6ca4 
   src/java/systest/org/apache/zookeeper/test/system/InstanceManager.java 
 809fa4819eed61aee3fcee1b5641ec85b967d479 
   src/java/systest/org/apache/zookeeper/test/system/SimpleSysTest.java 
 9cdf4d912a29e8a5341e4a9700fd07e1eeb015f3 
   src/java/test/org/apache/zookeeper/common/TimeTest.java PRE-CREATION 
   src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java 
 9abe47910f5d73195c57e9f33d9d2150a4861141 
   src/java/test/org/apache/zookeeper/test/ClientBase.java 
 a6229b50b4a4486b443daa6b3b92ac4ab5cf94cb 
   src/java/test/org/apache/zookeeper/test/ClientHammerTest.java 
 b807dbb0f4350b29190b5d5862c418de84a168c5 
   src/java/test/org/apache/zookeeper/test/CnxManagerTest.java 
 563c77c41c86c692edfd95ea48d397bc25154d26 
   src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java 
 ab84146f58e8f97ef24517703c30ef6015a71c84 
   src/java/test/org/apache/zookeeper/test/ReadOnlyModeTest.java 
 0579858659cec892aee3fa4362d0c55d175d87a7 
   src/java/test/org/apache/zookeeper/test/StaticHostProviderTest.java 
 bf1dcef7fbca91fee6128096e8413013fa11e0e0 
   src/java/test/org/apache/zookeeper

[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2015-02-05 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307646#comment-14307646
 ] 

Marshall McMullen commented on ZOOKEEPER-1366:
--

Latest version looks great to me. 

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Hongchao Deng
Priority: Critical
 Fix For: 3.5.1

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 zookeeper-3.4.5-ZK1366-SC01.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1366) Zookeeper should be tolerant of clock adjustments

2015-02-04 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304763#comment-14304763
 ] 

Marshall McMullen commented on ZOOKEEPER-1366:
--

[~hdeng] - I will be happy to help review this tomorrow. It's important to us 
to pick up this fix as well so I'd love to see this rolled into the 3.5 
release. I'll make sure to review this and add comments to the review tomorrow.

 Zookeeper should be tolerant of clock adjustments
 -

 Key: ZOOKEEPER-1366
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1366
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Ted Dunning
Assignee: Hongchao Deng
Priority: Critical
 Fix For: 3.5.1

 Attachments: ZOOKEEPER-1366-3.3.3.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, 
 ZOOKEEPER-1366.patch, ZOOKEEPER-1366.patch, zookeeper-3.4.5-ZK1366-SC01.patch


 If you want to wreak havoc on a ZK based system just do [date -s +1hour] 
 and watch the mayhem as all sessions expire at once.
 This shouldn't happen.  Zookeeper could easily know handle elapsed times as 
 elapsed times rather than as differences between absolute times.  The 
 absolute times are subject to adjustment when the clock is set while a timer 
 is not subject to this problem.  In Java, System.currentTimeMillis() gives 
 you absolute time while System.nanoTime() gives you time based on a timer 
 from an arbitrary epoch.
 I have done this and have been running tests now for some tens of minutes 
 with no failures.  I will set up a test machine to redo the build again on 
 Ubuntu and post a patch here for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-14 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171988#comment-14171988
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I'm going to go look over the final version of this patch on RB, but I think 
you guys have absolutely nailed this problem. I would I could give some useful 
insight into why it was originally implemented this way but I think it was just 
an oversight on our part. The particular use case of deleting a multi with 
intermixed ephemeral nodes is one we would never have encountered or tested 
against and thus I probably just didn't think of that... Anyhow, great find.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Fix For: 3.4.7, 3.5.1, 3.6.0

 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 26437: ZooKeeper-2052

2014-10-14 Thread Marshall McMullen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26437/#review56658
---



src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java
https://reviews.apache.org/r/26437/#comment97061

Thanks for adding this comment here.



src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java
https://reviews.apache.org/r/26437/#comment97062

Really good additional tests. Nice job.


- Marshall McMullen


On Oct. 8, 2014, 9:18 p.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26437/
 ---
 
 (Updated Oct. 8, 2014, 9:18 p.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 ZooKeeper-2052
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 8542790 
   src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java 
 8caf419 
   src/java/test/org/apache/zookeeper/test/ClientBase.java a6229b5 
   src/java/test/org/apache/zookeeper/test/MultiTransactionTest.java a573180 
 
 Diff: https://reviews.apache.org/r/26437/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Hongchao Deng
 




[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-14 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171996#comment-14171996
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I reviewed the RB and the changes look solid to me. +1 from me.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Fix For: 3.4.7, 3.5.1, 3.6.0

 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 26437: ZooKeeper-2052

2014-10-14 Thread Marshall McMullen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26437/#review56660
---

Ship it!


Ship It!

- Marshall McMullen


On Oct. 8, 2014, 9:18 p.m., Hongchao Deng wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26437/
 ---
 
 (Updated Oct. 8, 2014, 9:18 p.m.)
 
 
 Review request for zookeeper.
 
 
 Repository: zookeeper-git
 
 
 Description
 ---
 
 ZooKeeper-2052
 
 
 Diffs
 -
 
   src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 8542790 
   src/java/test/org/apache/zookeeper/server/PrepRequestProcessorTest.java 
 8caf419 
   src/java/test/org/apache/zookeeper/test/ClientBase.java a6229b5 
   src/java/test/org/apache/zookeeper/test/MultiTransactionTest.java a573180 
 
 Diff: https://reviews.apache.org/r/26437/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Hongchao Deng
 




[jira] [Commented] (ZOOKEEPER-2052) Unable to delete a node when the node has no children

2014-10-10 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166509#comment-14166509
 ] 

Marshall McMullen commented on ZOOKEEPER-2052:
--

I'm just seeing this jira for the first time as well. It looks like a really 
fantastic find and definitely very concerning if the issue is indeed as you 
describe. I'm pretty swamped at work at present so it may take me a few days 
before I'll have a chance to dig into this but I'll be very happy to do so... 
Will update when I've had a chance to digest this issue and comment on it.

 Unable to delete a node when the node has no children
 -

 Key: ZOOKEEPER-2052
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2052
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.6, 3.5.0
 Environment: Red Hat Enterprise Linux 6.1 x86_64, standalone or 3 
 node ensemble (v3.4.6), 2 Java clients (v3.4.6)
Reporter: Yip Ng
Assignee: Hongchao Deng
 Attachments: ZOOKEEPER-2052-v2.patch, 
 ZOOKEEPER-2052-v3-release.patch, ZOOKEEPER-2052-v3.patch, 
 ZOOKEEPER-2052-v4.patch, ZOOKEEPER-2052.patch, ZOOKEEPER-2052.patch, 
 ZOOKEEPER-2052.patch, test-jenkins.patch, zookeeper.log


 We stumbled upon a ZooKeeper bug where a node with no children cannot be 
 removed on our 3 node ZooKeeper ensemble or standalone ZooKeeper on Red Hat 
 Enterprise Linux x86_64 environment.  Here is an example scenario/setup:
 o Standalone ZooKeeper or 3 node ensemble (v3.4.6)
 o 2 Java clients (v3.4.6)
   - Client A creates a persistent node (e.g.:  /metadata/resources)
   - Client B creates ephemeral nodes under this persistent node 
 o Client A attempts to remove the /metadata/resources node via multi op  
delete but fails since there are children
 o Client B's session expired, all the ephemeral nodes are removed
 o Client A attempts to recursively remove /metadata/resources node via 
multi op, this is expected to succeed but got the following exception:
   org.apache.zookeeper.KeeperException$NotEmptyException: 
  KeeperErrorCode = Directory not empty
(Note that Client B is the only client that creates these ephemeral nodes)
 o After this, we use zkCli.sh to inspect the problematic node but the 
 zkCli.sh shows the /metadata/resources node indeed have no children but it 
 will not allow /metadata/resources node to get deleted.  (shown below)
 [zk: localhost:2181(CONNECTED) 0] ls /
 [zookeeper, metadata]
 [zk: localhost:2181(CONNECTED) 1] ls /metadata
 [resources]
 [zk: localhost:2181(CONNECTED) 2] get /metadata/resources
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 [zk: localhost:2181(CONNECTED) 3] delete /metadata/resources
 Node not empty: /metadata/resources
 [zk: localhost:2181(CONNECTED) 4] get /metadata/resources   
 null
 cZxid = 0x3
 ctime = Wed Oct 01 22:04:11 PDT 2014
 mZxid = 0x3
 mtime = Wed Oct 01 22:04:11 PDT 2014
 pZxid = 0x9
 cversion = 2
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0
 o The only ways to remove this node is to either:
a) Restart the ZooKeeper server
b) set data to /metadata/resources then followed by a subsequent delete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1636) c-client crash when zoo_amulti failed

2014-09-25 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147804#comment-14147804
 ] 

Marshall McMullen commented on ZOOKEEPER-1636:
--

Fantastic find, patch and unit tests. Looks like great hardening around this 
code path to me. Nice job.

 c-client crash when zoo_amulti failed 
 --

 Key: ZOOKEEPER-1636
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1636
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.3
Reporter: Thawan Kooburat
Assignee: Thawan Kooburat
Priority: Critical
 Fix For: 3.4.7, 3.5.1

 Attachments: ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, 
 ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch, ZOOKEEPER-1636.patch


 deserialize_response for multi operation don't handle the case where the 
 server fail to send back response. (Eg. when multi packet is too large) 
 c-client will try to process completion of all sub-request as if the 
 operation is successful and will eventually cause SIGSEGV



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2016) Automate client-side rebalancing

2014-08-21 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105825#comment-14105825
 ] 

Marshall McMullen commented on ZOOKEEPER-2016:
--

[~shralex] - I agree this sounds useful but only if it is something we can 
opt-in for. Lots of application code which sits on top of the C bindings may 
prefer to have more direct control over this than having it automatically 
rebalance for them. 

 Automate client-side rebalancing
 

 Key: ZOOKEEPER-2016
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2016
 Project: ZooKeeper
  Issue Type: Improvement
Reporter: Hongchao Deng

 ZOOKEEPER-1355 introduced client-side rebalancing, which is implemented in 
 both the C and Java client libraries. However, it requires the client to 
 detect a configuration change and call updateServerList with the new 
 connection string (see reconfig manual). It may be better if the client just 
 indicates that he is interested in this feature when creating a ZK handle and 
 we'll detect configuration changes and invoke updateServerList for him 
 underneath the hood.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1994) Backup config files.

2014-08-01 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082700#comment-14082700
 ] 

Marshall McMullen commented on ZOOKEEPER-1994:
--

I strongly agree with Alex on this as well. I would like them to be named using 
zxid as well. As Alex explained, that is much safer from a consistency point of 
view and much easier to correlate to the reconfiguration as well as different 
replicas. +1 from me.

 Backup config files.
 

 Key: ZOOKEEPER-1994
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1994
 Project: ZooKeeper
  Issue Type: Improvement
Affects Versions: 3.5.0
Reporter: Hongchao Deng
Assignee: Hongchao Deng
 Fix For: 3.5.0


 We should create a backup file for a static or dynamic configuration file 
 before changing the file. 
 Since the static file is changed at most twice (once when removing the 
 ensemble definitions, at which point a dynamic file doesn't exist yet, and 
 once when removing clientPort information) its probably fine to back up the 
 static file independently from the dynamic file. 
 To track backup history:
 Option 1: we could have a .bakXX extention for backup where XX is a  sequence 
 number. 
 Option 2: have the configuration version be part of the file name for dynamic 
 configuration files (instead of in the file like now). Such as 
 zoo_replicated1.cfg.dynamic.100 then on reconfiguration simply create a 
 new dynamic file (with new version) and update the link in the static file to 
 point to the new dynamic one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1998) C library calls getaddrinfo unconditionally from zookeeper_interest

2014-07-29 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078237#comment-14078237
 ] 

Marshall McMullen commented on ZOOKEEPER-1998:
--

[~rgs] - yep, you're right. I added that code as part of ZOOKEEPER-107 working 
with [~shralex]. But if I recall correctly, the original code also 
unconditionally called resolve_hosts. Though I'd have to go look at the 
original code to confirm that. I'm guessing you've done that already and that 
it did not do that? 

Do you have thoughts on how we could avoid this? I suppose we could easily just 
check if the addrvec is the same and if it is bypass resolving the hosts. 

 C library calls getaddrinfo unconditionally from zookeeper_interest
 ---

 Key: ZOOKEEPER-1998
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1998
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.0
Reporter: Raul Gutierrez Segales
Assignee: Raul Gutierrez Segales
Priority: Critical
 Fix For: 3.5.0


 (commented this on ZOOKEEPER-338)
 I've just noticed that we call getaddrinfo from zookeeper_interest... on 
 every call. So from zookeeper_interest we always call update_addrs:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L2082
 which in turns unconditionally calls resolve_hosts:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L787
 which does the unconditional calls to getaddrinfo:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L648
 We should fix this since it'll make 3.5.0 slower for people relying on DNS. I 
 think this is happened as part of ZOOKEEPER-107 in which the list of servers 
 can be updated. 
 cc: [~shralex], [~phunt], [~fpj]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1998) C library calls getaddrinfo unconditionally from zookeeper_interest

2014-07-29 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078406#comment-14078406
 ] 

Marshall McMullen commented on ZOOKEEPER-1998:
--

[~rgs] - Looking at the 3.4 code I agree with you. It seems like we should only 
do the lookup when we are connecting.

 C library calls getaddrinfo unconditionally from zookeeper_interest
 ---

 Key: ZOOKEEPER-1998
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1998
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.5.0
Reporter: Raul Gutierrez Segales
Assignee: Raul Gutierrez Segales
Priority: Critical
 Fix For: 3.5.0


 (commented this on ZOOKEEPER-338)
 I've just noticed that we call getaddrinfo from zookeeper_interest... on 
 every call. So from zookeeper_interest we always call update_addrs:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L2082
 which in turns unconditionally calls resolve_hosts:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L787
 which does the unconditional calls to getaddrinfo:
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/zookeeper.c#L648
 We should fix this since it'll make 3.5.0 slower for people relying on DNS. I 
 think this is happened as part of ZOOKEEPER-107 in which the list of servers 
 can be updated. 
 cc: [~shralex], [~phunt], [~fpj]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1997) Why is there a standalone mode

2014-07-28 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076767#comment-14076767
 ] 

Marshall McMullen commented on ZOOKEEPER-1997:
--

With reconfig you still cannot grow from standalone to quorum mode. There are 
many many use cases for the standalone mode -- most notable for embedded unit 
tests or for non-HA clusters which are use for simulations or test environments 
where we don't need quorum mode.

 Why is there a standalone mode
 --

 Key: ZOOKEEPER-1997
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1997
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Hongchao Deng

 It seems there is a special standalone mode.
 With the coming of reconfig, this doesn't make any sense.
 A single server can also be configured later to add more servers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-19 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037586#comment-14037586
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

[~rgs] - No, we are not using local sessions. 

 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a value of 
 3 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 2. In our code path we never set this value from 3 to 2. We 
 throw an assertion if the value ever goes backwards. Which is how we caught 
 this error. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-12 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen reassigned ZOOKEEPER-1937:


Assignee: Marshall McMullen

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen

 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029565#comment-14029565
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

Patch submitted. The one in the rpm directory was actually already using bash, 
but it didn't follow our convention of using /usr/bin/env so I fixed that one 
as well to be consistent.

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-12 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1937:
-

Attachment: ZOOKEEPER-1719.patch

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029659#comment-14029659
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

No new unit tests added as this only changes the shebang at the top of some 
unused init scripts. The test failure can't possibly be related. But looks very 
troubling:

 [exec]  [exec] *** glibc detected *** ./zktest-mt: free(): invalid 
pointer: 0x2ba1446d ***


 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan
Assignee: Marshall McMullen
 Attachments: ZOOKEEPER-1719.patch


 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-11 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028178#comment-14028178
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

[~michim] - thanks for looking at this issue. I saw the same code you linked to 
and agree on the intended behavior. The log message in that block of code is 
NOT present. 

We did not see /binchanges update to the correct value of 3. It looked to be 
stuck at 2. Which really defies explanation.

 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a value of 
 3 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 2. In our code path we never set this value from 3 to 2. We 
 throw an assertion if the value ever goes backwards. Which is how we caught 
 this error. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1937) init script needs fixing for ZOOKEEPER-1719

2014-06-10 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027370#comment-14027370
 ] 

Marshall McMullen commented on ZOOKEEPER-1937:
--

[~CpuID] - Yep, looks like the same problem.  I wasn't aware of the file 
src/packages/deb/init.d/zookeeper. But it should probably be fixed in the same 
manner. Do you want to upload a patch? Otherwise I can do so.

 init script needs fixing for ZOOKEEPER-1719
 ---

 Key: ZOOKEEPER-1937
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1937
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.6
 Environment: Linux (Ubuntu 12.04)
Reporter: Nathan Sullivan

 ZOOKEEPER-1719 changed the interpreter to bash for zkCli.sh, zkServer.sh and 
 zkEnv.sh, but did not change src/packages/deb/init.d/zookeeper 
 This causes the following failure using /bin/sh
 [...] root@hostname:~# service zookeeper stop
 /etc/init.d/zookeeper: 81: /usr/libexec/zkEnv.sh: Syntax error: ( 
 unexpected (expecting fi)
 Simple fix, change the shebang to #!/bin/bash - tested and works fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-09 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-

Affects Version/s: 3.5.0

 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.5.0
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a value of 
 3 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 2. In our code path we never set this value from 3 to 2. We 
 throw an assertion if the value ever goes backwards. Which is how we caught 
 this error. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)
Marshall McMullen created ZOOKEEPER-1934:


 Summary: Stale data received from sync'd ensemble peer
 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log

In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a version of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4_zookeeper.log.







--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-

Attachment: node5.log
node4.log
node3.log
node2.log
node1.log

Log files from all 5 ensemble nodes.

 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a version 
 of 4 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 3. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4_zookeeper.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019389#comment-14019389
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

Diffing the zktreeutil dumps of each server is also interesting. There are a 
few minor differences with local sessions:

diff -a node1.zktree node3.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9162,9163d9159
 |   
 |--[72058779056865292]

diff -a node1.zktree node4.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9005,9006d9002
 |--[216173168961912851]
 |   
9162,9163d9157
 |   
 |--[72058779056865292]

diff -a node1.zktree node5.zktree 
8933,8934d8932
 |--[144115323715452941]
 |   
9005,9006d9002
 |--[216173168961912851]
 |   
9065,9066d9060
 |--[288230547757793293]
 |   
9162,9163d9155
 |   
 |--[72058779056865292]

Whereas node2 is MASSIVELY different.

In particular, the /binchanges value is different:

|--[binchanges] |--[binchanges]
|   |   |   |   
|   |--[version = 3] | |   |--[version 
= 2]




 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server. When we tried 
 to read the same path from this new zookeeper server we are getting a stale 
 value.
 Specifically, we are reading /binchanges and originally got back a version 
 of 4 from the first server. After we lost connection and reconnected before 
 the session timeout, we then read /binchanges from the new server and got 
 back a value of 3. 
 It's my understanding of the single system image guarantee that this should 
 never be allowed. I realize that the single system image guarantee is still 
 quorum based and it's certainly possible that a minority of the ensemble may 
 have stale data. However, I also believe that each client has to send the 
 highest zxid it's seen as part of its connection request to the server. And 
 if the server it's connecting to has a smaller zxid than the value the client 
 sends, then the connection request should be refused.
 Assuming I have all of that correct, then I'm at a loss for how this 
 happened. 
 The failure happened around Jun  4 08:13:44. Just before that, at June  4 
 08:13:30 there was a round of leader election. During that round of leader 
 election we voted server with id=4 and zxid=0x31c4c. This then led to a 
 new zxid=0x40001. The new leader sends a diff to all the servers 
 including the one we will soon read the stale data from (id=2). Server with 
 ID=2's log files also reflect that as of 08:13:43 it was up to date and 
 current with an UPTODATE message.
 I'm going to attach log files from all 5 ensemble nodes. I also used 
 zktreeutil to dump the database out for the 5 ensemble nodes. I diff'd those, 
 and compared them all for correctness. 1 of the nodes (id=2) has a massively 
 divergent zktreeutil dump than the other 4 nodes even though it received the 
 diff from the new leader.
 In the attachments there are 5 nodes. I will number each log file by it's 
 zookeeper id, e.g. node4_zookeeper.log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019392#comment-14019392
 ] 

Marshall McMullen commented on ZOOKEEPER-1934:
--

Yet before we grabbed this data, the offending node (nodeid=2, myid=2) stated 
this:

1723 Jun  4 08:13:30 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:ZooKeeperServer@156] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/sf/data/zoo
1724 Jun  4 08:13:30 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Follower@66] - FOLLOWING - LEADER ELECTION 
TOOK - -1401867542249
1725 Jun  4 08:13:30 zookeeper - WARN  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@240] - Unexpected exception, 
tries=0, connecting to /10.26.65.103:2182
1726 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:232)
1727 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:74)
1728 Jun  4 08:13:30 localhost at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:967)
1729 Jun  4 08:13:30 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:35987
1730 Jun  4 08:13:30 zookeeper - WARN  [NIOWorkerThread-37:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1731 Jun  4 08:13:30 zookeeper - INFO  [NIOWorkerThread-37:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:35987 (no session established 
for client)
1732 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:59764
1733 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-38:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1734 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-38:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:59764 (no session established 
for client)
1735 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.103:51005
1736 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-39:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1737 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-39:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.103:51005 (no session established 
for client)
1738 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:39628
1739 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-40:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1740 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-40:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:39628 (no session established 
for client)
1741 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:47705
1742 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-41:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1743 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-41:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:47705 (no session established 
for client)
1744 Jun  4 08:13:31 zookeeper - INFO  
[NIOServerCxnFactory.AcceptThread:/10.26.65.47:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.26.65.3:34353
1745 Jun  4 08:13:31 zookeeper - WARN  [NIOWorkerThread-42:NIOServerCnxn@365] - 
Exception causing close of session 0x0: ZooKeeperServer not running
1746 Jun  4 08:13:31 zookeeper - INFO  [NIOWorkerThread-42:NIOServerCnxn@999] - 
Closed socket connection for client /10.26.65.3:34353 (no session established 
for client)
1747 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@332] - Getting a diff from the 
leader 0x31c4c
1748 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@475] - Learner received NEWLEADER 
message
1749 Jun  4 08:13:31 zookeeper - WARN  
[QuorumPeer[myid=2]/10.26.65.47:2181:QuorumPeer@1271] - 
setLastSeenQuorumVerifier called with stale config 4294967296. Current version: 
4294967296
1750 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:FileTxnSnapLog@297] - Snapshotting: 
0x31c4c to /sf/data/zookeeper/10.26.65.47/version-2/snapshot.31c4c
1751 Jun  4 08:13:31 zookeeper - INFO  
[QuorumPeer[myid=2]/10.26.65.47:2181:Learner@460] - Learner received

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-

Description: 
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4_zookeeper.log.





  was:
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a version of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4_zookeeper.log.






 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-

Description: 
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4.log.





  was:
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4_zookeeper.log.






 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were caching a 
 value we read from zookeeper and then experienced session loss. We 
 subsequently got reconnected to a different zookeeper server

[jira] [Updated] (ZOOKEEPER-1934) Stale data received from sync'd ensemble peer

2014-06-05 Thread Marshall McMullen (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marshall McMullen updated ZOOKEEPER-1934:
-

Description: 
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of 
3 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 2. In our code path we never set this value from 3 to 2. We throw an 
assertion if the value ever goes backwards. Which is how we caught this error. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4.log.





  was:
In our regression testing we encountered an error wherein we were caching a 
value we read from zookeeper and then experienced session loss. We subsequently 
got reconnected to a different zookeeper server. When we tried to read the same 
path from this new zookeeper server we are getting a stale value.

Specifically, we are reading /binchanges and originally got back a value of 
4 from the first server. After we lost connection and reconnected before the 
session timeout, we then read /binchanges from the new server and got back a 
value of 3. 

It's my understanding of the single system image guarantee that this should 
never be allowed. I realize that the single system image guarantee is still 
quorum based and it's certainly possible that a minority of the ensemble may 
have stale data. However, I also believe that each client has to send the 
highest zxid it's seen as part of its connection request to the server. And if 
the server it's connecting to has a smaller zxid than the value the client 
sends, then the connection request should be refused.

Assuming I have all of that correct, then I'm at a loss for how this happened. 

The failure happened around Jun  4 08:13:44. Just before that, at June  4 
08:13:30 there was a round of leader election. During that round of leader 
election we voted server with id=4 and zxid=0x31c4c. This then led to a new 
zxid=0x40001. The new leader sends a diff to all the servers including the 
one we will soon read the stale data from (id=2). Server with ID=2's log files 
also reflect that as of 08:13:43 it was up to date and current with an UPTODATE 
message.

I'm going to attach log files from all 5 ensemble nodes. I also used zktreeutil 
to dump the database out for the 5 ensemble nodes. I diff'd those, and compared 
them all for correctness. 1 of the nodes (id=2) has a massively divergent 
zktreeutil dump than the other 4 nodes even though it received the diff from 
the new leader.

In the attachments there are 5 nodes. I will number each log file by it's 
zookeeper id, e.g. node4.log.






 Stale data received from sync'd ensemble peer
 -

 Key: ZOOKEEPER-1934
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1934
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Marshall McMullen
 Attachments: node1.log, node2.log, node3.log, node4.log, node5.log


 In our regression testing we encountered an error wherein we were

Re: [ANNOUNCE] New ZooKeeper committer: Rakesh R

2014-05-16 Thread Marshall McMullen
​Congrats Rakesh!​


On Fri, May 16, 2014 at 10:49 AM, Patrick Hunt ph...@apache.org wrote:

 The Apache ZooKeeper PMC recently extended committer karma to Rakesh
 and he has accepted. Rakesh has made some great contributions and we
 are looking forward to even more :)

 Congratulations and welcome aboard, Rakesh!

 Patrick



Re: [Release 3.5.0] Any news yet?

2014-04-08 Thread Marshall McMullen
Agree, sounds like a great plan to me as well.  Once we get an alpha
release we can do some internal stress testing against it in our lab to
help give higher confidence of its quality.


On Tue, Apr 8, 2014 at 11:19 PM, Michi Mutsuzaki mi...@cs.stanford.eduwrote:

 Sounds like a great plan.

 There is a patch available for review for flaky startSingleServerTest.

 https://issues.apache.org/jira/browse/ZOOKEEPER-1870

 On Tue, Apr 8, 2014 at 10:12 PM, Patrick Hunt ph...@apache.org wrote:
  I see some great progress in closing out jiras. Kudos all. We
  currently have 12 blockers for 3.5.0, 5 are listed as PA. Let's keep
  plugging away on these.
 
  The CI environment has gone a bit red of late. On my personal servers
  I'm seeing it mostly in startSingleServerTest though, so perhaps
  it's localized. Would be good to nail down the flakey tests. I'll try
  looking into this more (what the flakeys are) and report back. Apache
  CI seems a bit unstable of late (the jenkins env I mean)
 
  I'm thinking we might start creating some Alpha releases, e.g.
  zookeeper-3.5.0-alpha so that we can get the code into folks hands.
  They can try it out and give us feedback. It would be alpha quality
  though, not for production. APIs and new functionality might still
  change in a non-backward compatible way. We could do this even though
  we still have some blockers remaining. Once all the blockers are
  resolved we could move to beta status. The APIs, etc... would then
  be locked. Once things settle out during the beta cycle we could then
  move off beta. We'd only make 3.5 stable once we feel comfortable
  with it's quality and after collecting feedback from the community.
  This is basically what we did for 3.4 branch (and similar to what some
  other projects do). What do you think?
 
  Patrick
 
  On Thu, Apr 3, 2014 at 5:33 PM, Michi Mutsuzaki mi...@cs.stanford.edu
 wrote:
  ... there is one more. I just canceled it because the patch needs to be
 rebased.
 
  https://issues.apache.org/jira/browse/ZOOKEEPER-1794
 
  On Thu, Apr 3, 2014 at 5:31 PM, Michi Mutsuzaki mi...@cs.stanford.edu
 wrote:
  There are several large PAs:
 
  https://issues.apache.org/jira/browse/ZOOKEEPER-1172
  https://issues.apache.org/jira/browse/ZOOKEEPER-1346
  https://issues.apache.org/jira/browse/ZOOKEEPER-1607
  https://issues.apache.org/jira/browse/ZOOKEEPER-1907
 
 
  I think I can review ZOOKEEPER-1907 and get it in for 3.5.0, but we
  need shepherds for the other 3 JIRAs. Let me know if anybody has
  cycles to review and check in these patches. Otherwise I'll push them
  out of 3.5.0.
 
  Thanks!
  --Michi
 
  On Thu, Mar 20, 2014 at 1:56 PM, Alexander Shraer shra...@gmail.com
 wrote:
  right - looks like there's a patch there, waiting for review.
 
 
  On Thu, Mar 20, 2014 at 1:52 PM, Raúl Gutiérrez Segalés 
 r...@itevenworks.net
  wrote:
 
  On 20 March 2014 13:51, Raúl Gutiérrez Segalés r...@itevenworks.net
  wrote:
 
   what about https://issues.apache.org/jira/browse/ZOOKEEPER-1807?
  
 
  More context: people trying to use Observers in 3.5.0 without that
 fixed
  will have issues, for sure.
 
 
  -rgs
 
 
 
  
  
   -rgs
  
  
   On 20 March 2014 13:48, Alexander Shraer shra...@gmail.com
 wrote:
  
   thanks Patrick!
  
   regarding dynamic reconfig, IMHO we only have 2 blockers:
   - add JMX support
   (ZOOKEEPER-1659
 https://issues.apache.org/jira/browse/ZOOKEEPER-1659
   )
   - change leader timeout mechanism to give up when there's no
 quorum of
   last
   proposed configuration
   (ZOOKEEPER-1699
 https://issues.apache.org/jira/browse/ZOOKEEPER-1699
   )
  
   any help with either is greatly appreciated.
  
   On Thu, Mar 20, 2014 at 1:32 PM, Patrick Hunt ph...@apache.org
 wrote:
  
I'm resurrecting this thread now that 3.4.6 is out the door. I'm
assuming that we might do a 3.4.7 at some point, but that
 shouldn't
hold up releasing a 3.5.0.
   
We discussed a number of good ideas previously in this thread
 wrt what
we should do for 3.5.0. Any further thoughts?
   
A big part of the planning will be to clean up Jira and
 figuring out
what we need to finish. I'll start looking at that but if
 anyone else
has any ideas of can clean up jiras they are familiar with it
 would be
helpful.
   
Here's the list currently slated for 3.5.0 (258 total!):
http://bit.ly/1ijYAJF
   
There are 8 listed blockers at the moment. Most of which have
 to do
with the introduction of dynamic reconfiguration. Only one jira
 is PA.
   
There are 52 PA jiras currently slated for 3.5.0:
  http://bit.ly/PV21NA
   
Patrick
   
On Wed, Jul 10, 2013 at 1:38 PM, Flavio Junqueira 
   fpjunque...@yahoo.com
wrote:
 Sure, fine with me.

 -Flavio

 On Jul 10, 2013, at 7:31 PM, Mahadev Konar 
 maha...@hortonworks.com
  
wrote:

 It would be good if Flavio wants to try doing the RM. Flavio?

 thanks
 mahadev

 On Wed, Jul 10, 2013 at 10:20 AM, 

[jira] [Commented] (ZOOKEEPER-1167) C api lacks synchronous version of sync() call.

2014-03-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932459#comment-13932459
 ] 

Marshall McMullen commented on ZOOKEEPER-1167:
--

[~michim] - I'm not sure I agree. Ben's comments specifically state that this 
is not strictly required for the consistency protocol ZK provides. But if you 
are communicating through some other mechanism and you want to guarantee those 
two clients are synchronized, then this would be useful. Granted the 
application layer can provide it's own wrapper around zoo_async to provide this 
functionality. So I think the use case is for easier integration into higher 
level clients. That and consistency since this is the only non-sync API in the 
C bindings.  I'm still happy to add tests around this and also add a java 
implementation I just lost sight of this case.

 C api lacks synchronous version of sync() call.
 ---

 Key: ZOOKEEPER-1167
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1167
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3, 3.4.3, 3.5.0
Reporter: Nicholas Harteau
Assignee: Marshall McMullen
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1167.patch


 Reading through the source, the C API implements zoo_async() which is the 
 zookeeper sync() method implemented in the multithreaded/asynchronous C API.  
 It doesn't implement anything equivalent in the non-multithreaded API.
 I'm not sure if this was oversight or intentional, but it means that the 
 non-multithreaded API can't guarantee consistent client views on critical 
 reads.
 The zkperl bindings depend on the synchronous, non-multithreaded API so also 
 can't call sync() currently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1167) C api lacks synchronous version of sync() call.

2014-03-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932475#comment-13932475
 ] 

Marshall McMullen commented on ZOOKEEPER-1167:
--

[~michim] - thanks. Plus I've already patched our internal version of zookeeper 
so our application doesn't have to do this and would hate to have to maintain 
that forever :). I'll get an updated patch together so we can finish this one 
off.

 C api lacks synchronous version of sync() call.
 ---

 Key: ZOOKEEPER-1167
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1167
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3, 3.4.3, 3.5.0
Reporter: Nicholas Harteau
Assignee: Marshall McMullen
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1167.patch


 Reading through the source, the C API implements zoo_async() which is the 
 zookeeper sync() method implemented in the multithreaded/asynchronous C API.  
 It doesn't implement anything equivalent in the non-multithreaded API.
 I'm not sure if this was oversight or intentional, but it means that the 
 non-multithreaded API can't guarantee consistent client views on critical 
 reads.
 The zkperl bindings depend on the synchronous, non-multithreaded API so also 
 can't call sync() currently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1167) C api lacks synchronous version of sync() call.

2014-03-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932534#comment-13932534
 ] 

Marshall McMullen commented on ZOOKEEPER-1167:
--

Agreed.

 C api lacks synchronous version of sync() call.
 ---

 Key: ZOOKEEPER-1167
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1167
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3, 3.4.3, 3.5.0
Reporter: Nicholas Harteau
Assignee: Marshall McMullen
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1167.patch


 Reading through the source, the C API implements zoo_async() which is the 
 zookeeper sync() method implemented in the multithreaded/asynchronous C API.  
 It doesn't implement anything equivalent in the non-multithreaded API.
 I'm not sure if this was oversight or intentional, but it means that the 
 non-multithreaded API can't guarantee consistent client views on critical 
 reads.
 The zkperl bindings depend on the synchronous, non-multithreaded API so also 
 can't call sync() currently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1855) calls to zoo_set_server() fail to flush outstanding request queue.

2014-01-02 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860961#comment-13860961
 ] 

Marshall McMullen commented on ZOOKEEPER-1855:
--

As a workaround, what happens if before the client calls zoo_set_servers they 
first sync?

 calls to zoo_set_server() fail to flush outstanding request queue.
 --

 Key: ZOOKEEPER-1855
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1855
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: Dutch T. Meyer
Priority: Minor

 If one calls zoo_set_servers to update with a new server list that does not 
 contain the currently connected server, the client will disconnect.  Fair 
 enough, but any outstanding requests on the set_requests queue aren't 
 completed, so the next completed request from the new server can fail with an 
 out-of-order XID error.
 The disconnect occurs in update_addrs(), when a reconfig is necessary, though 
 it's not quite as easy as just calling cleanup_bufs there, because you could 
 then race the call to dequeue_completion in zookeeper_process and pull NULL 
 entries for a recently completed request
 I don't have a patch for this right now, but I do have a simple repro I can 
 post when time permits.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [jira] [Commented] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host

2013-12-20 Thread Marshall McMullen
The logic of how we connect to servers in trunk 3.5.0 is substantially
different than what was in 3.4.6. Has this bug been seen in 3.4.6 or trunk?


On Fri, Dec 20, 2013 at 4:14 PM, Flavio Junqueira (JIRA) j...@apache.orgwrote:


 [
 https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13854652#comment-13854652]

 Flavio Junqueira commented on ZOOKEEPER-1057:
 -

 If this is a change due to reconfig, do we really need to block 3.4.6?

  zookeeper c-client, connection to offline server fails to successfully
 fallback to second zk host
 
 -
 
  Key: ZOOKEEPER-1057
  URL:
 https://issues.apache.org/jira/browse/ZOOKEEPER-1057
  Project: ZooKeeper
   Issue Type: Bug
   Components: c client
 Affects Versions: 3.3.1, 3.3.2, 3.3.3
  Environment: snowdutyrise-lm ~/- uname -a
  Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15
 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386
  also observed on:
  2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
 Reporter: Woody Anderson
 Assignee: Michi Mutsuzaki
 Priority: Blocker
  Fix For: 3.4.6, 3.5.0
 
  Attachments: ZOOKEEPER-1057.patch, ZOOKEEPER-1057.patch
 
 
  Hello, I'm a contributor for the node.js zookeeper module:
 https://github.com/yfinkelstein/node-zookeeper
  i'm using zk 3.3.3 for the purposes of this issue, but i have validated
 it fails on 3.3.1 and 3.3.2
  i'm having an issue when trying to connect when one of my zookeeper
 servers is offline.
  if the first server attempted is online, all is good.
  if the offline server is attempted first, then the client is never able
 to connect to _any_ server.
  inside zookeeper.c a connection loss (-4) is received, the socket is
 closed and buffers are cleaned up, it then attempts the next server in the
 list, creates a new socket (which gets the same fd as the previously closed
 socket) and connecting fails, and it continues to fail seemingly forever.
  The nature of this fail is not that it gets -4 connection loss errors,
 but that zookeeper_interest doesn't find anything going on on the socket
 before the user provided timeout kicks things out. I don't want to have to
 wait 5 minutes, even if i could make myself.
  this is the message that follows the connection loss:
  2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530:
 Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out):
 connection timed out (exceeded timeout by 3ms)
  2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213:
 yield:zookeeper_interest returned error: -7 - operation timeout
  While investigating, i decided to comment out close(zh-fd) in
 handle_error (zookeeper.c#1153)
  now everything works (obviously i'm leaking an fd). Connection the the
 second host works immediately.
  this is the behavior i'm looking for, though i clearly don't want to
 leak the fd, so i'm wondering why the fd re-use is causing this issue.
  close() is not returning an error (i checked even though current code
 assumes success).
  i'm on osx 10.6.7
  i tried adding a setsockopt so_linger (though i didn't want that to be a
 solution), it didn't work.
  full debug traces are included in issue here:
 https://github.com/yfinkelstein/node-zookeeper/issues/6



 --
 This message was sent by Atlassian JIRA
 (v6.1.4#6159)



[jira] [Commented] (ZOOKEEPER-1388) Client side 'PathValidation' is missing for the multi-transaction api.

2013-12-16 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850111#comment-13850111
 ] 

Marshall McMullen commented on ZOOKEEPER-1388:
--

I'm not too familiar with how the client-side path validation works in the java 
client code. We don't do anything similar to that in the C client code (that 
I'm aware of). Can someone explain how that is safe? If the client is connected 
to a server that does not have a fully sync'd copy of the database then the 
client may preemptively fail the multi-op whereas if it had forwarded the 
entire multi-op to the server it would have properly succeeded. 

It's really important to understand that the original design we followed with a 
multi-op was to treat it as a transaction (write operation) rather than a read 
operation. As a transaction/write operation, it's my understanding of zab that 
we are required to forward the operation on to the leader rather than taking 
any action locally so that the leader can broadcast out the transaction to the 
entire ensemble for consideration. 

If the client does any path validation locally that seems like a violation of 
the zab protocol as I understand it.

Someone feel free to correct me if I am misunderstanding things.

 Client side 'PathValidation' is missing for the multi-transaction api.
 --

 Key: ZOOKEEPER-1388
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1388
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.0
Reporter: Rakesh R
Assignee: Rakesh R
 Fix For: 3.4.6, 3.5.0

 Attachments: 0001-ZOOKEEPER-1388-trunk-version.patch, 
 0002-ZOOKEEPER-1388-trunk-version.patch, ZOOKEEPER-1388.patch, 
 ZOOKEEPER-1388.patch, ZOOKEEPER-1388.patch, ZOOKEEPER-1388_branch_3_4.patch


 Multi ops: Op.create(path,..), Op.delete(path, ..), Op.setData(path, ..), 
 Op.check(path, ...) apis are not performing the client side path validation 
 and the call will go to the server side and is throwing exception back to the 
 client. 
 It would be good to provide ZooKeeper client side path validation for the 
 multi transaction apis. Presently its getting err codes from the server, 
 which is also not properly conveying the cause.
 For example: When specified invalid znode path in Op.create, it giving the 
 following exception. This will not be useful to know the actual cause.
 {code}
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1174)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1115)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (ZOOKEEPER-1388) Client side 'PathValidation' is missing for the multi-transaction api.

2013-12-16 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13850157#comment-13850157
 ] 

Marshall McMullen commented on ZOOKEEPER-1388:
--

After having much more clarity after [~rakeshr]'s comments, this patch looks 
good to me. +1.

 Client side 'PathValidation' is missing for the multi-transaction api.
 --

 Key: ZOOKEEPER-1388
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1388
 Project: ZooKeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.4.0
Reporter: Rakesh R
Assignee: Rakesh R
 Fix For: 3.4.6, 3.5.0

 Attachments: 0001-ZOOKEEPER-1388-trunk-version.patch, 
 0002-ZOOKEEPER-1388-trunk-version.patch, ZOOKEEPER-1388.patch, 
 ZOOKEEPER-1388.patch, ZOOKEEPER-1388.patch, ZOOKEEPER-1388_branch_3_4.patch


 Multi ops: Op.create(path,..), Op.delete(path, ..), Op.setData(path, ..), 
 Op.check(path, ...) apis are not performing the client side path validation 
 and the call will go to the server side and is throwing exception back to the 
 client. 
 It would be good to provide ZooKeeper client side path validation for the 
 multi transaction apis. Presently its getting err codes from the server, 
 which is also not properly conveying the cause.
 For example: When specified invalid znode path in Op.create, it giving the 
 following exception. This will not be useful to know the actual cause.
 {code}
 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
   at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1174)
   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1115)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (ZOOKEEPER-1836) addrvec_next() fails to set next parameter if addrvec_hasnext() returns false

2013-12-12 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13847116#comment-13847116
 ] 

Marshall McMullen commented on ZOOKEEPER-1836:
--

Yes, that was what I intended for this to do. Nice catch. 

Would be great if you could submit a patch. If you can't I'll look at this 
later this week.

 addrvec_next() fails to set next parameter if addrvec_hasnext() returns false
 -

 Key: ZOOKEEPER-1836
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1836
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Reporter: Dutch T. Meyer
Priority: Trivial

 There is a relatively innocuous but useless pointer assignment in
 addrvec_next():
 195   void addrvec_next(addrvec_t *avec, struct sockaddr_storage *next)
 
 203   if (!addrvec_hasnext(avec))
 204   {
 205   next = NULL;
 206   return;
 That assignment on (205) has no point, as next is a local variable lost upon 
 function return.  Likely this should be a memset to zero out the actual 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


ZOOKEEPER-1732 workaround?

2013-10-03 Thread Marshall McMullen
We've just hit ZOOKEEPER-1732 (server cannot join an established ensemble
with quorum) and are trying to find a workaround since we do not have that
patch applied. Has anyone had any success working around this issue?
Perhaps restarting all ZK servers to force a new round of leader election?
Any other ideas? Really appreciate any advice...

Thanks!


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-03 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785329#comment-13785329
 ] 

Marshall McMullen commented on ZOOKEEPER-1732:
--

We've just run into this issue running tip of trunk 3.5.0 *without* this patch 
applied. Are there any proposed workarounds to this problem? I tried removing 
the stuck node from the ensemble and adding another node in as a replacement 
but it is now hitting the same problem... It can't join the ensemble either. 
I'm considering restarting all zookeeper servers in the hopes that a new round 
of leader election will reset things. Does this sound safe? Are there any other 
alternatives? Really appreciate any help.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-03 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785345#comment-13785345
 ] 

Marshall McMullen commented on ZOOKEEPER-1732:
--

Flavio, that suggestion worked perfectly! Simply restarting the leader caused a 
new round of leader election and things sorted themselves out within a few 
seconds. Thank you so much for such a prompt reply. Love this community! 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Critical
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: ZOOKEEPER-1732 workaround?

2013-10-03 Thread Marshall McMullen
Flavio made a suggestion on the jira to simply restart the current leader.
It worked! Within a few seconds everything sorted itself out. Thanks so
much for the amazingly prompt reply. Love this community! :).


On Thu, Oct 3, 2013 at 10:46 AM, Marshall McMullen 
marshall.mcmul...@gmail.com wrote:

 We've just hit ZOOKEEPER-1732 (server cannot join an established ensemble
 with quorum) and are trying to find a workaround since we do not have that
 patch applied. Has anyone had any success working around this issue?
 Perhaps restarting all ZK servers to force a new round of leader election?
 Any other ideas? Really appreciate any advice...

 Thanks!



[jira] [Commented] (ZOOKEEPER-1519) Zookeeper Async calls can reference free()'d memory

2013-10-03 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785884#comment-13785884
 ] 

Marshall McMullen commented on ZOOKEEPER-1519:
--

I agree with Flavio... All the work I did in the C client made the same 
contract that the caller owns the memory not the C client. This is typical of 
all async interfaces I've used (e.g. asio, etc). 

 Zookeeper Async calls can reference free()'d memory
 ---

 Key: ZOOKEEPER-1519
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1519
 Project: ZooKeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.3, 3.3.6
 Environment: Ubuntu 11.10, Ubuntu packaged Zookeeper 3.3.3 with some 
 backported fixes.
Reporter: Mark Gius
Assignee: Daniel Lescohier
 Fix For: 3.4.6, 3.5.0

 Attachments: zookeeper-1519.patch


 zoo_acreate() and zoo_aset() take a char * argument for data and prepare a 
 call to zookeeper.  This char * doesn't seem to be duplicated at any point, 
 making it possible that the caller of the asynchronous function might 
 potentially free() the char * argument before the zookeeper library completes 
 its request.  This is unlikely to present a real problem unless the freed 
 memory is re-used before zookeeper consumes it.  I've been unable to 
 reproduce this issue using pure C as a result.
 However, ZKPython is a whole different story.  Consider this snippet:
   ok = zookeeper.acreate(handle, path, json.dumps(value), 
  acl, flags, callback)
   assert ok == zookeeper.OK
 In this snippet, json.dumps() allocates a string which is passed into the 
 acreate().  When acreate() returns, the zookeeper request has been 
 constructed with a pointer to the string allocated by json.dumps().  Also 
 when acreate() returns, that string is now referenced by 0 things (ZKPython 
 doesn't bump the refcount) and the string is eligible for garbage collection 
 and re-use.  The Zookeeper request now has a pointer to dangerous freed 
 memory.
 I've been seeing odd behavior in our development environments for some time 
 now, where it appeared as though two separate JSON payloads had been joined 
 together.  Python has been allocating a new JSON string in the middle of the 
 old string that an incomplete zookeeper async call had not yet processed.
 I am not sure if this is a behavior that should be documented, or if the C 
 binding implementation needs to be updated to create copies of the data 
 payload provided for aset and acreate.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1096) Leader communication should listen on specified IP, not wildcard address

2013-09-26 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778511#comment-13778511
 ] 

Marshall McMullen commented on ZOOKEEPER-1096:
--

Thanks Germán and Flavio! Really nice job finishing this one up.

 Leader communication should listen on specified IP, not wildcard address
 

 Key: ZOOKEEPER-1096
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1096
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Jared Cantwell
Assignee: Germán Blanco
Priority: Minor
 Fix For: 3.5.0, 3.4.6

 Attachments: ZOOKEEPER-1096_branch3.4.patch, 
 ZOOKEEPER-1096_branch3.4.patch, ZOOKEEPER-1096_branch3.4.patch, 
 ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch, 
 ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch


 Server should specify the local address that is used for leader communication 
 and leader election (and not use the default of listening on all interfaces). 
  This is similar to the clientPortAddress parameter that was added a year 
 ago.  After reviewing the code, we can't think of a reason why only the port 
 would be used with the wildcard interface, when servers are already 
 connecting specifically to that interface anyway.
 I have submitted a patch, but it does not account for all leader election 
 algorithms.
 Probably should have an option to toggle this, for backwards compatibility, 
 although it seems like it would be a bug if this change broke things.
 There is some more information about making it an option here:
 http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/201008.mbox/%3CAANLkTikkT97Djqt3CU=h2+7gnj_4p28hgcxjh345h...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1760) Provide an interface for check version of a node

2013-09-24 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777135#comment-13777135
 ] 

Marshall McMullen commented on ZOOKEEPER-1760:
--

I agree with Flavio and Benjamin as well. The Multi requires a check to be able 
to detect race conditions between related paths in an atomic way. You can't do 
that at all without a multi so the use case doesn't make sense to me. And as 
others already said, you can get what you want with Stat. In our own local 
wrapper we have sitting on top of ZooKeeper client, we've added all sorts of 
convenience methods like this, e.g. a GetVersion, GetNumChildren, etc., which 
are all implemented via a call to stat.

 Provide an interface for check version of a node
 

 Key: ZOOKEEPER-1760
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1760
 Project: ZooKeeper
  Issue Type: New Feature
  Components: java client
Reporter: Rakesh R
Assignee: Rakesh R
 Fix For: 3.5.0


 The idea of this JIRA is to discuss the check version interface which is used 
 to see the existence of a node for the specified version. Presently only 
 multi transaction api has this interface, this umbrella JIRA is to make 
 'check version' api part of ZooKeeper# main apis and cli command.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1096) Leader communication should listen on specified IP, not wildcard address

2013-09-17 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769565#comment-13769565
 ] 

Marshall McMullen commented on ZOOKEEPER-1096:
--

This latest version looks really good. I especially like that it's configured 
via the configuration file. Nicely done.

+1 from me.

 Leader communication should listen on specified IP, not wildcard address
 

 Key: ZOOKEEPER-1096
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1096
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Minor
 Fix For: 3.5.0, 3.4.6

 Attachments: ZOOKEEPER-1096_branch3.4.patch, 
 ZOOKEEPER-1096_branch3.4.patch, ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch, 
 ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch


 Server should specify the local address that is used for leader communication 
 and leader election (and not use the default of listening on all interfaces). 
  This is similar to the clientPortAddress parameter that was added a year 
 ago.  After reviewing the code, we can't think of a reason why only the port 
 would be used with the wildcard interface, when servers are already 
 connecting specifically to that interface anyway.
 I have submitted a patch, but it does not account for all leader election 
 algorithms.
 Probably should have an option to toggle this, for backwards compatibility, 
 although it seems like it would be a bug if this change broke things.
 There is some more information about making it an option here:
 http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/201008.mbox/%3CAANLkTikkT97Djqt3CU=h2+7gnj_4p28hgcxjh345h...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1096) Leader communication should listen on specified IP, not wildcard address

2013-09-17 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770029#comment-13770029
 ] 

Marshall McMullen commented on ZOOKEEPER-1096:
--

Flavio brings up a great point. It seems like most would want to change both 
fle and zab rather than one or the other separately. I like the idea of making 
this a single property.

 Leader communication should listen on specified IP, not wildcard address
 

 Key: ZOOKEEPER-1096
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1096
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Minor
 Fix For: 3.5.0, 3.4.6

 Attachments: ZOOKEEPER-1096_branch3.4.patch, 
 ZOOKEEPER-1096_branch3.4.patch, ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch, 
 ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch


 Server should specify the local address that is used for leader communication 
 and leader election (and not use the default of listening on all interfaces). 
  This is similar to the clientPortAddress parameter that was added a year 
 ago.  After reviewing the code, we can't think of a reason why only the port 
 would be used with the wildcard interface, when servers are already 
 connecting specifically to that interface anyway.
 I have submitted a patch, but it does not account for all leader election 
 algorithms.
 Probably should have an option to toggle this, for backwards compatibility, 
 although it seems like it would be a bug if this change broke things.
 There is some more information about making it an option here:
 http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/201008.mbox/%3CAANLkTikkT97Djqt3CU=h2+7gnj_4p28hgcxjh345h...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1096) Leader communication should listen on specified IP, not wildcard address

2013-09-13 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766511#comment-13766511
 ] 

Marshall McMullen commented on ZOOKEEPER-1096:
--

+1 for using the config file to configure the ports and any behavior thereof as 
this matches the way we configure client ports and is a lot easier to use and 
deploy on a mass scale IMO than java properties.

 Leader communication should listen on specified IP, not wildcard address
 

 Key: ZOOKEEPER-1096
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1096
 Project: ZooKeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.3.3, 3.4.0
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Minor
 Fix For: 3.5.0, 3.4.6

 Attachments: ZOOKEEPER-1096_branch3.4.patch, ZOOKEEPER-1096.patch, 
 ZOOKEEPER-1096.patch, ZOOKEEPER-1096.patch


 Server should specify the local address that is used for leader communication 
 and leader election (and not use the default of listening on all interfaces). 
  This is similar to the clientPortAddress parameter that was added a year 
 ago.  After reviewing the code, we can't think of a reason why only the port 
 would be used with the wildcard interface, when servers are already 
 connecting specifically to that interface anyway.
 I have submitted a patch, but it does not account for all leader election 
 algorithms.
 Probably should have an option to toggle this, for backwards compatibility, 
 although it seems like it would be a bug if this change broke things.
 There is some more information about making it an option here:
 http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/201008.mbox/%3CAANLkTikkT97Djqt3CU=h2+7gnj_4p28hgcxjh345h...@mail.gmail.com%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   4   >