Code0x58 commented on issue #2919: Launch PEX on Kubernetes fails URL: https://github.com/apache/incubator-heron/issues/2919#issuecomment-399708245 I found I had to up the `minikube` resources, and wait for everything to come up before heron was deployed properly, so issues with time+lacking resources on my laptop/network were a problem to start with. I also got the two issues: 1. _Failed to get physical plan for topology ExclamationTopology_ - this is mentioned in the [troubleshooting guide](https://apache.github.io/incubator-heron/docs/getting-started-troubleshooting/#2-why-does-the-topology-launch-successfully-but-fail-to-start) but that wasn't helpful in this case - it only helps if there was an executor failure (it doesn't mention `~/.herondata/` is on the worker). I think the issue happens when the workers (topology pods) aren't up yet either because of start times/resources/executor failures. With a bit of poking around, it looks like this happens when trying to activate the topology while it is in the UNKOWN status. I guess this issue is less apparent on beefier setups. The documentation on troubleshooting could do with being updated, as well as docs for using Kubernetes/minikue and what to look out for with resource issues. It would be nice if the issue if the CLI gave more feedback if there are resource issues, and if it waited around for executors to come up. 2. _Caused by: org.apache.distributedlog.exceptions.WriteException: Write rejected because stream xxxxxxxxxx-cristobal-tag-0-7116362347360204918.tar.gz has encountered an error : writer has been closed due to error_ comes up for me with the larger PEX too. I think the topology is supposed to be uploaded to ZooKeeper by BookKeeper, but it is dying due to the ZooKeeper client session timing out (which I saw as a warning in BookKeeper's logs), which explains why smaller ones are more likely to succeed. I suspect increasing the [tick duration](https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html) will help, especially when using minikube when your machine is going to be under a lot of load. This would be something to document, or try avoiding by tweaking the deploy files to include an extended tick. I havent tested this, but am pretty sure that is the case. It feels like the moral of the story is "have a beefy AF setup so you are less likely to encounter issues", time for a Dell Precision?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
