Apache Pinot Daily Email Digest (2021-09-07)

Pinot Slack Email Digest Tue, 07 Sep 2021 19:00:41 -0700

#general

@vutruongxuan99: @vutruongxuan99 has joined the channel
@dadelcas: @dadelcas has joined the channel
@dadelcas: hello everyone :slightly_smiling_face:
@grace.lu: Hello Pinot experts, I wonder if anyone here running Pinot on k8s in production have suggestions for pinot disaster recovery plan from k8s cluster downtime. Assume we are in a environment with multiple k8s clusters running, which of the following would you recommend to let Pinot be resilient to k8s cluster level outage or maintenance: 1. Setting up Pinot cluster across multiple k8s environment with each of them holding one set of data replication. --- (not sure if it is feasible or easy to do) 2. Setting up fully replicated redundant Pinot clusters in different k8s environment, also replicating the data ingestion and anything we did in main cluster. --- (seems costly) 3. Only setting up Pinot running in one k8s cluster, in the case of a k8s cluster outage, rebuild the server, controller, broker in another healthy k8s cluster and let it pick up the old states from kafka, zookeeper, s3, etc. --- (How hard is it for a newly build pinot cluster to inherit and resume the old states?) Any experience sharing on handling this in a prod environment is much appreciated :pray::skin-tone-2:. Thanks in advance!
@mayanks: You could have Pinot deployment across availability zones? What's your cloud provider?
@xiangfu0: current pinot k8s deployment is one cluster per k8s, which means you will have N pinot clusters in your N k8s clusters. This is like fully replicated all-active story.
@xiangfu0: I would say do 2 replicates per k8s cluster and have a load balancer on top of all pinot clusters
@xiangfu0: btw, what’s the availability of your k8s? if it’s high enough, you can just one pinot cluster on one k8s.
@sina.tamizi: @sina.tamizi has joined the channel

#random

@vutruongxuan99: @vutruongxuan99 has joined the channel
@dadelcas: @dadelcas has joined the channel
@sina.tamizi: @sina.tamizi has joined the channel

#troubleshooting

@vutruongxuan99: @vutruongxuan99 has joined the channel
@nadeemsadim: I can see the *pinot-zookeeper disk usage high* .. what could be the root cause since metadata cant be 65 gb out of 95 gb given with few million records in table .. is it database indexing causing some external views to be stored on zookeeper disk? @mayanks @xiangfu0 @jackie.jxt @ssubrama @g.kishore
@mrpringle: You need to setup a zookeeper clean up job. To remove old snapshots.
@mrpringle:
@mayanks: Thanks @mrpringle
@mrpringle: I'm trying to use tenant tags with the kafka low level consumer to split consuming v consumed partitions across servers. However the offline servers don't seem to be getting any segments. Are there additional steps needed to get this to work. Am also using upsert functionality.
@mayanks: Consumed is still part of RT table. Offline is what is pushed from offline ingestion flow
@mayanks: You probably want to check out
@npawar: Are you trying to use this?
@zsolt: We are running pinot in kubernetes, and noticed that the servers are considered ready too early, before the server has managed to start. This causes the statefulset rolling restart to restart multiple servers simultaneously, making segments inaccessible. The server api `/health` endpoint should be used for readiness probing?
@mayanks: Broker routes the query to a server for only segments that are online.
@zsolt: In our case 7 out of 8 servers were restarting at the same time
@mayanks: Are you using replica groups? If so, you could do one replica at a time?
@zsolt: we are not using it
@zsolt: and we are doing helm upgrades for config changes, so it's not done manually
@mayanks: @xiangfu0 Any suggestions? IIRC, there are deployments that have hooks that wait for sometime (x minutes) before reporting healthy? cc: @jackie.jxt
@jackie.jxt: Which version of pinot are you running? How do you shut down the servers? We need to ensure the shut down hook is called when shutting down the servers
@zsolt: Running 0.7.1 with the helm chart from the repo. When we do a helm upgrade (i.e. last time I've configured s3 retries for the Servers), the pods are restarted by the StatefulSet controller, using the default *RollingUpdate* strategy. The controller waits for the restarted pod to be Ready, then proceeds to restart the next one. The standard kubernetes termination is SIGTERM followed by SIGKILL after 30s if not terminated.
@zsolt: In the chart the Brokers have the /health readiness probe, that's why I'm wondering why the Servers don't have it set.
@jackie.jxt: Here is a fix for adding the shutdown hook for the server:
@jackie.jxt: Seems it is not included in `0.8.0`, so you need to try either the current master or wait for the next release
@jackie.jxt: Adding @xiangfu0 to take a look as well
@dadelcas: @dadelcas has joined the channel
@dadelcas: Hi everyone, I have an issue with a k8s deployment. Basically controllers are discovered twice: once via headless service and one more time via regular service. The one discovered through the regular services is always reported as "failed" as there is no ZK entry with the FQDN of the service. Is there any way to fix this?
@xiangfu0: headless svc is used for internal pod discovery, e.g. pinot-controller-0, pinot-server-2 …
@xiangfu0: zk svc/headless-svc are there but not exposed externally
@dadelcas: I understand but I don't know why Helix is discovering one controller per service. This doesn't happen with the the brokers or any other node type, just the controllers
@dadelcas: and I don't know whether this will have side effects
@dadelcas: I'd like to have controllers reported correctly
@xiangfu0: from helix side, each controller will register itself
@xiangfu0: the svc side might be the deep store or vip config？
@dadelcas: I'm not sure if I follow you
@dadelcas: I have deploy the helm chart with the defaults so the FS is the node HD
@dadelcas: but I don't know how this could have anything to do with the controllers showing twice
@xiangfu0: hmm, where you find controller showing twice
@xiangfu0: in k8s or logs or helix ?
@dadelcas: in the /instances API and therefore in the UI as well
@dadelcas: Helix manager returns 2 records for the controller as per my first message
@dadelcas: there is a single controller POD
@xiangfu0: can you paste a screen shot?
@xiangfu0: we will check on that
@dadelcas: let me see if I can pull a screenshot, I may need to blur some details though.. bear with me
@xiangfu0: sure, in general, each pinot pod will register itself, and it’s true for k8s world, fqdn is just its pod name, svc names are externally, so shouldn’t be counted here
@dadelcas:
@dadelcas: I agree, the fact of using kubernetes should not change anything here
@dadelcas: Unfortunately I'm still too new to the code and I can't find a solution myself
@dadelcas: let me know if that helps
@xiangfu0: hmm, seems something goes wrong, I’ll check
@dadelcas: cheers!
@gqian3: Hi team, recently we had experienced Pinot server out of memory issue in a 4 server Pinot cluster when issuing one query select distinct id on entire table with only 1 billion record. We had to manually restart the Pinot servers pods to recover. Is this normal? Is there some index or Pinot cofigurariom we can add to this id or Pinot cluster to prevent it bring down the entire cluster.
@ken: We had run into a similar issue on 0.6 when doing a `DISTINCTCOUNT` on a high cardinality field. Only the broker was jammed, so restarting that process fixed things. From the stack trace, it seemed like the problem was caused by very large responses from the server processes causing blockage at the broker network layer. I don’t think there was a way to prevent this from happening, at least with that version.
@sina.tamizi: @sina.tamizi has joined the channel

#pinot-dev

@mosiac: Hello, Im looking into writing a custom kafka connector (used for connecting with a proprietary kafka service). What would be the best way of doing this and also being able to pull changes from the public repo? My first solution was to duplicate the pinot-kafka-2.0 module and adapt that, but this results in a lot of duplicate files and if the original module changes i'll have to reimplement those changes in mine. My changes aren't that big, only to KafkaPartitionLevelConnectionHandler and KafkaPartitionLevelStreamConfig, and a version bump for the kafka client to 2.5
@mayanks: May be extend the existing impl and override whatever needs to be?
@dadelcas: @dadelcas has joined the channel

#getting-started

@luisfernandez: when we insert data into pinot how is replication achieved? is it when a segment is completed that we make this data available to other nodes?
@kulbir.nijjer: Depends on type of table - realtime vs. offline. For realtime as many servers(consumers) as the replication factor, start consuming data in parallel from the streaming source. Whenever segment is completed controller gets notified and it picks one of the replica servers to commit the segment to and also update the segment store. For offline servers -since segment is already generated earlier, replication simply controls which servers from the pool host the offline segment and it's decided by Controller. More details are defined here: as well as offline/batch data flow.
@kangren.chia: if i wish to use the native java client but only can have my broker/controller exposed outside of the cluster, is my only option to use `ConnectionFactory.fromHostList(brokerUrl)`? im not all that familiar with ZK and i dont see a way in the API to retrieve broker addresses from the zookeeper category of APIs exposed by the controller
@xiangfu0: you need to expose broker externally then use broker list to query

#releases

@dadelcas: @dadelcas has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org