Apache Pinot Daily Email Digest (2022-02-25)

Pinot Slack Email Digest Fri, 25 Feb 2022 18:00:35 -0800

#general

@sunhee.bigdata: Hi :-) I am new to Pinot. I am trying to test ACL. I want to set table ACL on user. I checked that controller, broker has acl config. but whenever add table or change table ACL, should I restart controller / broker ???
@mayanks: Restart is needed only in case where you have to modify the controller/broker/server configs (eg add a new user).
@apte.kaivalya: Hello, I want to run realtimeprovisioninghelper, where can I find a sample completed segment?
@mark.needham: It's not too tricky to create one. We have a recipe here showing how to create one from JSON files -
@apte.kaivalya: Thank you, I will take a look
@ssubrama: The helper also auto-generates data given the schema. @moradi.sajjad
@ssubrama: @apte.kaivalya just run the RealtimeProvisioningHelper with `--help` and you will get the options to auto-generate data
@apte.kaivalya: Awesome, thanks @ssubrama
@moradi.sajjad: As Subbu mentioned, if you don't have a segment in hand, you can provide schema which is decorated with data characteristics. Then the tool generates a segment based on the provided characteristics behind the scene and proceeds with the realtime analysis. To use data generation mode, instead of `-sampleCompletedSegmentDir` parameter, you need to provide `-schemaWithMetadataFile` parameter.
@ssubrama: I have updated the document to reflect this

#troubleshooting

@diogo.baeder: Hi folks! I'm having an issue which is, my queries are timing out because they run past the default timeout set for queries to Pinot; I know that it's possible to use for example `option(timeoutMs=60000)` after a table name to increase the timeouts, but the problem is, I'm using the SQLAlchemy library in my Python project and haven't yet found a way to make it compile that into the query. Is there some other way to increase the timeouts on a per-query basis? Something like a `SET TIMEOUT=60000` query I can execute before my normal query?
@mayanks: I’d first try to improve the query latency before increasing timeout
@mayanks: Is there a set of expensive queries that always timeout even when no load on cluster? If so can you share query and response metadata
@diogo.baeder: Ideally that's what I want to do, but I need to increase the timeout as a stop-gap for now, otherwise I'll have to spend days or weeks until I can get my latency increased, because more serious changes will have to be made probably
@diogo.baeder: What I need to do is to get a truckload of data up to 100M rows and then aggregate that inside my application (because the aggregation logic is more complex than what Pinot is able to do on its own), and because of the amount of data Pinot is timing out
@mayanks: I think there are broker level time out configs you can set. However I am worried increasing timeout may adversely impact the cluster because less number of queries will run in parallel (due to expensive queries), and may cause more backlog
@mayanks: What’s your read qps
@diogo.baeder: My read QPS is really low, hardly ever there will be multiple people doing queries actually, so increasing latency a bit is not a big issue for us. Even if it takes like 1 minute to run.
@diogo.baeder: But I'll try to increase the broker timeout config then, I think it might be enough for now. Later on I want to run some database recommendations so that I can find out how I can optimize it further, but this is for the future, I can't work on that now.
@diogo.baeder: (At the moment I'm trying to wrap up my stuff so that I can finally have some analytics UI available for our internal users)
@diogo.baeder: Folks, if I have Pinot running on an EKS cluster, with 2 servers running as pods, is it safe to just restart them? Should I do anything before doing that? I'm asking because I'm getting them marked as "Dead" in the controller UI
@diogo.baeder: They died because of OOM, maybe because of caching, I don't know
@npawar: it is safe to restart them.
@npawar: would be good to capture the logs, to identify why the OOM though
@diogo.baeder: Probably because of a mistake a made: During the development of an app that uses it, I mistakenly organized the query in a way that I ended up discarding the constraints, especially `timestamp` ones, and it ended up fetching some of the columns but since the beginning
@diogo.baeder: Sorry to spam you guys here, but yet another question: if I see something like this in the broker logs: `requestId=14,table=<redacted>,timeMs=545,docs=259503/9327428,entries=3080570/1038012,segments` this means that a query took 545ms to yield a result? Or does it just mean that the broker processed the query in that time and then sent the data queries to the servers? I'm asking this because to get all the data into my application (+ SQLAlchemy processing time) it took about 40s, so I'm wondering where all that time is being spent... (I might just do some profiling on my side, but I'm asking here because I want to have a better understanding of the logs I get from Pinot)
@mayanks: Total end-end latency as seen by broker is 545ms
@mayanks: Is there a complex json de-serialization happening on client side?
@diogo.baeder: Ah, got it! Yeah, I did some profiling, there's some parts of the `pinotdb` library that could improve I think, but also a lot of inneficient processing on my side too - I first convert timestamps into `datetime` objects, and then do the aggregation on them, when I should actually be doing the inverse, first aggregating and only then converting to `datetime`. The reason why I do this is because this is for analysing user sessions in our website, where each session is a chunk of requests no longer than 30min apart, and since I didn't find any function in Pinot that could do this sort of aggregation I'm doing this in Python. But I recon I could do much better than this.
@diogo.baeder: But if you know if it's possible to do something like this on the Pinot side (broker perhaps), I'd happily favor that instead :slightly_smiling_face:
@mayanks: I am not fully sure if I understand the reuqirement, but you can always to datetime transforms on pinot side.
@diogo.baeder: Yeah, I know, that I already do, but it transforms the data into a string, I then transform into a Python `datetime.datetime` instance (which is not a string). But don't worry :slightly_smiling_face: By the way, I just found out a quick and dirty, but somewhat reliable, way to cut out 10s from those 40s just by accessing some internals of `pinotdb` :smile:
@diogo.baeder: There's a problem in the library which is, when iterating over the results it gets from the Broker API, it keeps popping the first element of a `list`, and doing this is inefficient in Python. I'll try to improve that in the library soon, if I find the time, there are other collections in Python that can be more appropriate.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2022-02-25)

#general

#troubleshooting

Reply via email to