Apache Pinot Daily Email Digest (2020-11-16)

Pinot Slack Email Digest Mon, 16 Nov 2020 18:00:51 -0800

#general

@ken: Question about hierarchical aggregations. In Elasticsearch we can (for example) group by day, and then sub-group in each day by some attribute, and sum a metric, and get the top 10 results (for that sum of that metric) per day. It doesn’t seem possible to do this with Pinot, but wanted to confirm, as my SQL skills are pretty rusty, thanks!
@g.kishore: order by within a sub group is not possible in Pinot. If you need this, it requires writing a special aggregation function
@g.kishore: Do you have the elastic search query handy?
@ken: Checking…
@ken: Here’s one that buckets by gender, sub-buckets by state, aggregate medicare payments. I’d have to poke around more to find one that limits the states to the top N.
@ken: ``` json{``` “size”:0, “aggs”: { “by_gender”: { “terms”: { “field”: “nppes_provider_gender”, “size”: 20 }, “aggs”: { “by_state”: { “terms”: { “field”: “nppes_provider_state”, “size”: 20 }, “aggs”: { “total_payment”: { “sum”: { “field”: “medicare_payment_amt” } } } } } } } }
@ken: And as you might be able to tell, the JSON representation for aggregation queries in Elasticsearch is… not so clear.
@g.kishore: reading
@ken: I took a quick look at . So assuming I implemented a custom function that supported hierarchical grouping with a limit on sub-buckets, what would the PQL query look like? `select sum(metric) from mytable group by mycustomfunction(gender, state, <max states returned>)`?
@g.kishore: it would be more like this
@g.kishore: `select gender, myCustomFunction(metric, state, <max_states>) from mytable group by gender`
@g.kishore: but in sql language this is needs over (partition by) support which is not supported in Pinot today
@g.kishore: can you file an github issue, this is an useful feature
@ken: An issue for limiting a sub-group, or support for OVER?
@ken: And when you say “in sql language…” do you mean it could work with PQL, or it’s a no-go currently no matter what?
@ken: @g.kishore - heading out for a bit, many thanks for the useful info. I’ll follow up with a GitHub issue once I hear back, thanks again!
@g.kishore: issue for limiting sub-group
@ken: Got it. So to wrap up, is there currently any way to implement what I need (via a custom function), without adding OVER() support to Pinot?
@g.kishore: yes, you can write a custom aggregation function
@g.kishore:
@ken: OK, now I understand that when you said “but in sql language this is needs over…“. You were talking about a non-UDF approach. Cool, thanks again, I’ll file that issue about limiting sub-group.
@g.kishore: will be good to add a generic UDF here which takes a column(s) and another aggregation function
@ken: So something like `subGroup(function, limit, columns…)`?
@g.kishore: yes
@ken: Any existing code that remotely resembles this?
@g.kishore: you can see AvgAggregationFunction (simple) or ThetaSketch - a more complex version which might resemble the multi argument
@g.kishore: ```DistinctCountThetaSketchAggregationFunction```
@pabraham.usa: hello, I am sending kafka JSON messages with following structure. `{data: string, timestamp: yyyy-mm-dd hh:mm:ss:ms zone , attra: string, mainattr: {attra: string, attrb:string, attrc: string}}` consumed by pinot. Would like to check whether the "timeColumnName" should match the "timestamp" from the kafka messsage. if I have to match how will I specify the date format ? "segmentsConfig": { "timeColumnName": "mergedTimeMillis", "timeType": "MILLISECONDS",
@fx19880617: you can create the field with simpledateformat() or have a transform function during ingestion to convert it to epoch value ()
@pabraham.usa: great let me check this , Thanks @fx19880617
@mboyanna: @mboyanna has joined the channel
@monisb: @monisb has joined the channel

#random

@mboyanna: @mboyanna has joined the channel
@monisb: @monisb has joined the channel

#troubleshooting

@elon.azoulay: Hi, is there a way to throttle pinot ingestion? We resolved by scaling up, but pinot server pods were in a crash loop due to a spike in ingestion. We were thinking creating backpressure is more acceptable. lmk if there is a way to do this. btw, pinot rebalance is amazing, saved us again, no data loss:) Thanks!
@mayanks: @moradi.sajjad Is working on throttling RT ingestion.
@elon.azoulay: thats great!
@moradi.sajjad: @moradi.sajjad has joined the channel
@tanmay.movva: Hello, I was using the realtime provisioning tool with the following sample data ```Size of segment directory = 239.4mb Number of documents = 3529197 retention = 30 days ingestion rate = 1000 numPartitions = 1 Table Replicas = 1``` I am running this by building pinot from source and the results are quite surprising.
@tanmay.movva:
@tanmay.movva: 81g per host server memory looked suspicious and upon looking into the code I observe that it is considering the segment directories to be present in memory completely. That is ```active memory per host = Number of active(consuming + completed/retained) segments * Size of the segment directory. ```
@tanmay.movva: From what I understand, these segment directories should be present on server’s disk and maintain a mmap for indexes in memory. So that at query time based on demand, segments can be paged in. Please correct me if I am wrong.
@tanmay.movva: Can anyone please help me out here? Let me know if I have to raise an issue on github for this. Thanks!!
@g.kishore: That formula does not seem right @npawar ^^
@ssubrama: The formula is correct. Pinot maps all segments into memory. Since you have 30d retention, the code expects that at any time, all of the 30d will be used for queries. If this is not the case, you can choose to reduce the ratio of mapped vs active memory by increasing the total amount of memory available, in the command line argument. (Btw, if you cut/paste the entire output it helps better).
@ssubrama: The more paging you have, the latency is likely to be higher. Consuming segments are read-write memory, so mapping them on to disk can cause paging when rows are ingested since writes happen. To improve performance, you may want to map the memory for consuming segments on a `tmpfs` file system. Completed segments do not suffer writes, so you can choose to tune it mapping as much memory as you would like that does not affect your latency.

#discuss-validation

@pabraham.usa: @pabraham.usa has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2020-11-16)

#general

#random

#troubleshooting

#discuss-validation

Reply via email to