Code0x58 commented on issue #2919: Launch PEX on Kubernetes fails
URL: 
https://github.com/apache/incubator-heron/issues/2919#issuecomment-399708245
 
 
   I found I had to up the `minikube` resources, and wait for everything to 
come up before heron was deployed properly, so issues with time+lacking 
resources on my laptop/network were a problem to start with.
   
   I also got the two issues:
    1. _Failed to get physical plan for topology ExclamationTopology_ - this is 
mentioned in the [troubleshooting 
guide](https://apache.github.io/incubator-heron/docs/getting-started-troubleshooting/#2-why-does-the-topology-launch-successfully-but-fail-to-start)
 but that wasn't helpful in this case - it only helps if there was an executor 
failure (it doesn't mention `~/.herondata/` is on the worker). I think the 
issue happens when the workers (topology pods) aren't up yet either because of 
start times/resources/executor failures. With a bit of poking around, it looks 
like this happens when trying to activate the topology while it is in the 
UNKOWN status. I guess this issue is less apparent on beefier setups.
    The documentation on troubleshooting could do with being updated, as well 
as docs for using Kubernetes/minikue and what to look out for with resource 
issues. It would be nice if the issue if the CLI gave more feedback if there 
are resource issues, and if it waited around for executors to come up.
   
    2. _Caused by: org.apache.distributedlog.exceptions.WriteException: Write 
rejected because stream xxxxxxxxxx-cristobal-tag-0-7116362347360204918.tar.gz 
has encountered an error : writer has been closed due to error_ comes up for me 
with the larger PEX too. I think the topology is supposed to be uploaded to 
ZooKeeper by BookKeeper, but it is dying due to the ZooKeeper client session 
timing out (which I saw as a warning in BookKeeper's logs), which explains why 
smaller ones are more likely to succeed. I suspect increasing the [tick 
duration](https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html) will 
help, especially when using minikube when your machine is going to be under a 
lot of load.
   This would be something to document, or try avoiding by tweaking the deploy 
files to include an extended tick. I havent tested this, but am pretty sure 
that is the case.
   
   It feels like the moral of the story is "have a beefy AF setup so you are 
less likely to encounter issues", time for a Dell Precision?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to