churromorales commented on PR #13156:
URL: https://github.com/apache/druid/pull/13156#issuecomment-1291145564
@gianm I was testing the MM-less patch on the msq work you did. I ran a
test ingestion and the tasks just hang forever, after a bit of debugging here
is what is happening, launch a controller with one worker:
I get this exception:
```
2022-10-25T20:32:25,413 ERROR [ServiceClientFactory-0]
com.google.common.util.concurrent.ExecutionList - RuntimeException while
executing runnable com.google.common.util.concurrent.Futures$4@7f14c4aa with
executor
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService@138e191f
java.lang.NullPointerException: host
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
~[guava-16.0.1.jar:?]
at org.apache.druid.rpc.ServiceLocation.<init>(ServiceLocation.java:39)
~[druid-server-24.0.0-6.jar:24.0.0-6]
at
org.apache.druid.rpc.indexing.SpecificTaskServiceLocator$1.onSuccess(SpecificTaskServiceLocator.java:137)
~[druid-server-24.0.0-6.jar:24.0.0-6]
at
org.apache.druid.rpc.indexing.SpecificTaskServiceLocator$1.onSuccess(SpecificTaskServiceLocator.java:113)
~[druid-server-24.0.0-6.jar:24.0.0-6]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:185)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.Futures$ChainingListenableFuture$1.run(Futures.java:872)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.Futures$ImmediateFuture.addListener(Futures.java:102)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:868)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:185)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:53)
~[guava-16.0.1.jar:?]
at
org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:194)
~[druid-server-24.0.0-6.jar:24.0.0-6]
at
org.apache.druid.rpc.ServiceClientImpl$1.onSuccess(ServiceClientImpl.java:168)
~[druid-server-24.0.0-6.jar:24.0.0-6]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1181)
~[guava-16.0.1.jar:?]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
```
For the middle manager less patch in k8s. I launch a task, I added a setup
and teardown functions, so before every AbstractTask runs, it will announce
it's own location, then on teardown update status etc...
From what I see here, is that we do have a taskStatus (the task gets
launched) but the location has not yet been announced, in k8s we don't know the
location until the pod comes up and the service is available to take a request.
So in the msq patch, it doesn't wait for the location, it assumes it knows it.
But we need this for the MM-less patch.
TLDR, its a race, we try to get the location for the controller before it
announces it. The TaskLocation is `unknown` until the task's runTask() method
is invoked. The precondition on a null host in the ServiceLocation constructor
causes everything to die.
Do you have any advice how we can make these two co-exist? This is the only
blocker I see for this to work, everything else works as it did before. I
can't figure out a clean way, also I don't fully understand the msq patch, I
thought you might have a solution for this.
Thank you
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]