<h3><u>#general</u></h3><br><strong>@fra.costa: </strong>@fra.costa has joined the channel<br><h3><u>#random</u></h3><br><strong>@fra.costa: </strong>@fra.costa has joined the channel<br><h3><u>#troubleshooting</u></h3><br><strong>@g.kishore: </strong>How long does it take?<br><strong>@g.kishore: </strong>You can create another column for hour and add star tree if you want something really fast<br><strong>@pradeepgv42: </strong>it takes close to 30seconds for weeks worth of data, but I am being a bit miserly on cpu (doubling it reduced the latency to half, 16cores) Yeah will probably add an additional column or cache on the client side.<br><strong>@pradeepgv42: </strong>thanks<br><strong>@g.kishore: </strong>Range index on time column might also help but not sure how much it will help with time column since data is already time partitioned<br><strong>@pradeepgv42: </strong>yeah, true, star tree with with (date, hour) might be easier and faster it seems<br><strong>@fx19880617: </strong>how many groups you have here?<br><strong>@pradeepgv42: </strong>For time range queries should be less than ~1000, for other group by queries ~10000<br><strong>@elon.azoulay: </strong>qq - are there any performance differences between dimensionFields and metricFields? We had a user create a table with only dimensionFields and star tree indexes, etc. and didn't see any problems. Should the fields that were aggregated on (i.e. sum__<column>) have been metric fields?<br><strong>@elon.azoulay: </strong>In the docs it just says metric columns "typically" appear in aggregations but in this case dimension columns appear in the aggregations - and we don't see any issue with that. Are there any caveats to aggregating on dimension columns or grouping by metric columns?<br><h3><u>#dhill-date-seg</u></h3><br><strong>@npawar: </strong>@npawar has left the channel<br><h3><u>#pinot-dev</u></h3><br><strong>@tingchen: </strong><!here> I am working on the release notes of the coming Pinot 0.5 release. Here is a doc containing all the PRs merged since 0.4 for your reference <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMc9VK8AZw4xfCWnhVjqO8F2-2BFy089M5oqAvPMy33xfxW8FTDbYaPQ9CuX3FIc-2Fzc79-2F7m9BH8sqPl1NqCth-2F5m-2FE0GMJIzirtdLmh5jmRmBATYP4_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8-2Frt-2BwMSGCsK9bzLbvOMJB2uv-2B2ak0HDgoEKzmxVjvXibYT8O39NGTEfRq9HVbHMLbr-2BpPgGXrO84YMLjB-2BgrNYxM-2FqtPoIQbyEWlLPewZTPPRwAbwFRAz0IpIsSd5-2F-2BbB5zQHKgauNIluyydHli1MC8eZdJLMtyrRizueEqYjQY-3D>. Can you please email me at <mailto:[email protected]|[email protected]> or reply a similar email thread on dev@ email list if you have any of the following in the PR list?<br><strong>@tingchen: </strong>• Important features added in the new release. • Any deprecated configuration or external methods, noting these these may be removed later. • Any configuration/methods that was deprecated earlier, and have been removed from this release. • Any special notes for upgrading from previous releases (e.g. for zero-downtime upgrade, please upgrade to release X before this one). • Other special notes <br><strong>@g.kishore: </strong>do git log one line<br><strong>@g.kishore: </strong>```git log --pretty=oneline``` <br><strong>@tingchen: </strong>that will make the PR list looks better I guess. You can also search your id in the 50-page doc to see the PRs you have submitted.<br><h3><u>#release-certifier</u></h3><br><strong>@hara.acharya: </strong>have created the binitial framework and the lib with cluster creation method and also have logging enabled<br><h3><u>#pinot-docs</u></h3><br><strong>@g.kishore: </strong>@mayanks<br><strong>@mayanks: </strong>@mayanks has joined the channel<br><strong>@g.kishore: </strong>@ssubrama<br><strong>@ssubrama: </strong>@ssubrama has joined the channel<br><strong>@g.kishore: </strong>@steotia<br><strong>@steotia: </strong>@steotia has joined the channel<br><strong>@g.kishore: </strong>Hi @steotia @mayanks @ssubrama, Kartik is working on the docs and adding configuration reference to controller, server and broker<br><strong>@g.kishore: </strong>he needs some help on some of the properties<br><strong>@g.kishore: </strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZi1MZSkbl4JFUtJ8U-2FoWTVgoCFqcP56mjeZC9WiZvM7c-3DimPZ_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8UjRhdu-2F-2FlbPntuoa0O1ngPAWB-2FbTNsnpmKM-2F65z-2Fk4n1lWdYkYu7-2B2ZyQiA-2B-2BwzpSoDWXex7S-2BKdqBlI1Ffcp5tbd5Jsa44k1s-2BFVo1f08OqOS2XBAdkveB4-2FC5IXuOwrF8SyFOMD9aJAc167Qg830edisYHC9pD3r8eZmd8J2I-3D> <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZ9HoeYMvqwNg0Ruekl7GXWt3W-2FZIOHPR0DQRlVGHAHAg-3D-DeQ_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8Npd6RC5y7ON1J-2BXn2v-2B1eZtkwczv39osctV9gXKSlTgbOI6A-2B3sPdE72k5tCvrNZsq7uKBXy-2BcHh3iZUCdepWR3xgaC09QVgs-2BBLiNzGLcetQf6QDIrnJlb09c0IaWlCLUhWuaxO-2Bkrn9NNa-2BVWbc0qHopISd3RtMbwIuqi0iLs-3D> <https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZucSma-2FCJEuYTtSasVHSlu3BV7wymQPbmxUsk7N5Fl8Q-3DzRXC_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8jKSjgHtepRam2LxZLLwRcaY-2Ffh3t2lFxCxse2c3ajnSWFBnUgvNL-2BKOAAzBk42y5DeXBNAbrhpNFGaWOEx10SiWIoLgYwX82wvOHltSfOk6VS0CRvqpnPYOgSROIEmXLCAYmQLhOJB0LyP2or6hqDghPoFLMqAHU6fsr-2B0tWWuE-3D><br><strong>@g.kishore: </strong>can someone help with the description<br><strong>@g.kishore: </strong>@npawar<br><strong>@npawar: </strong>@npawar has joined the channel<br><h3><u>#multiple_streams</u></h3><br><strong>@anthonypt87: </strong>@anthonypt87 has joined the channel<br><strong>@g.kishore: </strong>@g.kishore has joined the channel<br><strong>@g.kishore: </strong>Can you provide more info on ingest mode/ rate<br><strong>@g.kishore: </strong>Also what kind of queries do you plan to run<br><strong>@g.kishore: </strong>Also dome info on how data is getting generated <br><strong>@anthonypt87: </strong>• Let’s say we had over events from both sources with order id, store_id, org_id, user name, total, date (orgs consist of many stores) • Fast data source we’re receiving it realtime as orders are happening (~1 event per second) • Accurate data source we get in bunches cronned daily (this can be on the order of ~10000 at the beginning of the day) • Common queries we want are total order by store_id or org_id in date ranges like 1 day, 7 day, 30 days. Get an order by order id We have on the order of 1000s of organizations and 10000s of stores<br><strong>@anthonypt87: </strong>thanks again kishore for taking the time to help!<br><strong>@g.kishore: </strong>I see<br><strong>@g.kishore: </strong>This should be easy<br><strong>@g.kishore: </strong>You can have concept of event time in your schema<br><strong>@g.kishore: </strong>When you push daily data to offline table, Pinot will automatically use that<br><strong>@g.kishore: </strong>This is how hybrid tables work in Pinot<br><strong>@surendra: </strong>@surendra has joined the channel<br><strong>@anthonypt87: </strong>interesting. I have to read more about this<br><strong>@g.kishore: </strong>See this page in docs<br><strong>@anthonypt87: </strong>does pinot just choose data in one data source over the other?<br><strong>@g.kishore: </strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdpzMyfof-2FnbcthTx3PKzMZINIDBWitJubpzb4wHRHhZoBqjPgernI234COh-2BttwZkaHCMzHiuUI9GGbIGpqTMc8STl48r2lJvReZ8fJO5meS5Wkh3CLZO4I8DjoJ0ofmPw-3DdiSM_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8pNfJEZRy21UM04E49toCzEwEdVL0m-2F1fsWiHPLjmQ1jZBAVZbu-2FIN3EMDq4UdQl-2BohjUbJ4n-2BrR2WrN14ctSLNURcJdhl2MnFHk2ZhnqnESYoMn7roRG8l9i0IOLEUa8r8fFPYVogbSM0scmy8XAnMwyYZVg0Vbn92dYpKlfbjg-3D><br><strong>@g.kishore: </strong>Yes, it picks data in offline table over real-time table<br><strong>@anthonypt87: </strong>oh interesting<br><strong>@anthonypt87: </strong>this actually may be exactly what we want<br><strong>@anthonypt87: </strong>adding my colleague to this room. @fra.costa ^<br><strong>@fra.costa: </strong>@fra.costa has joined the channel<br><strong>@anthonypt87: </strong>thanks so much for steering us towards this direction kishore!<br><strong>@anthonypt87: </strong>this is just hypothetical/educational but was wondering how we can handle this if the data actually wasn’t “offline”<br><strong>@anthonypt87: </strong>and we had two realtime streams?<br><strong>@anthonypt87: </strong>where we preferred one over the other<br><strong>@g.kishore: </strong>It’s hard without upsert <br><strong>@g.kishore: </strong>You can write a separate flink job<br><strong>@g.kishore: </strong>That merges the stream by using local store<br><strong>@g.kishore: </strong>And writes the events to another Kafka topic<br><strong>@anthonypt87: </strong>it’s tricky because we would still want the “fast data” first<br><strong>@anthonypt87: </strong>if it’s available<br><strong>@anthonypt87: </strong>are you talking about some sort of windowing?<br><strong>@g.kishore: </strong>Yeah that will work<br><strong>@g.kishore: </strong>You maintain a order ids that have already been emitted <br><strong>@g.kishore: </strong>Another option is dedup during ingestion based on order I’d<br><strong>@anthonypt87: </strong>would that delay data since we’d have to wait the “accurate” data source is available?<br><strong>@g.kishore: </strong>No<br><strong>@g.kishore: </strong>You write data as they come in<br><strong>@g.kishore: </strong>You just check if it already exists<br><strong>@g.kishore: </strong>You need another source to maintain list of order ids already ingested in the last X days<br><strong>@anthonypt87: </strong>hmm just to make sure I understand. say you had fast data first and then accurate data come in for order w/ id “1”<br><strong>@anthonypt87: </strong>fast data would come in and in your flink job you’d check the local data store. it doesn’t exist so you would store it in the local intermediate data store and then also write it out to your main one that serves your queries<br><strong>@anthonypt87: </strong>what do you do when accurate data comes in later? you check if it exists in local data store and it does. what’s the next step?<br><strong>@g.kishore: </strong>Skip it<br><strong>@anthonypt87: </strong>ah but I prefer data from the accurate data store<br><strong>@g.kishore: </strong>Why?<br><strong>@g.kishore: </strong>It has more fields or it’s really more accurate?<br><strong>@anthonypt87: </strong>the accurate data may have more accurate totals<br><strong>@anthonypt87: </strong>both<br><strong>@g.kishore: </strong>Got it<br><strong>@g.kishore: </strong>My bad, I missed that part<br><strong>@anthonypt87: </strong>no worries!<br><strong>@g.kishore: </strong>Then you need upsert<br><strong>@anthonypt87: </strong>yikes. would the hybrid table still be viable solution?<br><strong>@anthonypt87: </strong>it looks like if we set up the hybrid tables to operate over time periods, when it switches segments from online -> offline, it just uses the offline data based on time period<br><strong>@anthonypt87: </strong>so we’d lose information only found in the realtime segment (since the sets may not overlap)<br><strong>@g.kishore: </strong>Yes, hybrid solution will work<br><strong>@g.kishore: </strong>Right<br><strong>@anthonypt87: </strong>since we care about the superset of orders, across both segments, will it still work even if we push the offline segment to replace the recent time period?<br><strong>@anthonypt87: </strong>(i.e. essentially what we want is deduping on order id)<br><strong>@vallamsetty: </strong>@vallamsetty has joined the channel<br><strong>@anthonypt87: </strong>just continuing the discussion from yesterday, how do we see hybrid tables working in this case? e.g. we want all orders in the past 30 days (some orders are in the offline system, some in the online, and some in both)<br><strong>@g.kishore: </strong>accurate data you get at the end of day is superset of fast data rt?<br><strong>@anthonypt87: </strong>it is not unfortunately<br><strong>@anthonypt87: </strong>accurate data we get from a separate source and it may not include all the orders in fast data<br><strong>@g.kishore: </strong>ah<br><strong>@g.kishore: </strong>ok, then you need have a job that merges the fast data segments and accurate source<br><strong>@g.kishore: </strong>the fast data segments are available in the deep store as they get flushed<br><strong>@g.kishore: </strong>hybrid concept will still work, its just that you have to do some additional work while generating the batch segment<br><strong>@anthonypt87: </strong>oooh I see<br><strong>@anthonypt87: </strong>so it’s two part<br><strong>@anthonypt87: </strong><br><strong>@anthonypt87: </strong>something like this. haha sorry literally back of the envelope<br><strong>@anthonypt87: </strong>I think that’ll work<br><strong>@anthonypt87: </strong>thanks kishore :muscle:<br><strong>@g.kishore: </strong>conceptually yes<br><strong>@g.kishore: </strong>but let a draw a better diagram<br><strong>@g.kishore: </strong>hopefully :wink:<br><strong>@g.kishore: </strong><br><strong>@anthonypt87: </strong>haha thanks kishore. lucidchart?<br><strong>@g.kishore: </strong><https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYagbhNPz9W4U6hVglMW76xY-3DDT4__vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8h1hjXw1rAKD9YH-2FBp3eKQR9cqkIXOpCM9V1pwjiEJOhnFH3bw7YyeVd8gAZb-2FmgyOdrK4fsjpsD-2BbIWmOMn8HhirwJ5G8fREo9IVxhwiidBC5BN0rH-2FjZPY6xz-2BzntRYsPji-2BX9KCxVrZz6IZJ-2FgtdRmH0rGPAAcDKvjvH7ttn4-3D><br><strong>@fra.costa: </strong>@g.kishore is deleting rows (or more in general segments) something that Pinot supports? I found this config parameter for retention of deleting segments, but I can’t operatively understand how do you go about doing it
In other words is there a way in which I can delete a set of rows between 2 points in time?<br><strong>@g.kishore: </strong>We don’t support that<br><strong>@g.kishore: </strong>You can delete at segment granularity<br><strong>@fra.costa: </strong>Anthony pointed out that the offline rows win over the live, my question on delete was really to address that at serving and it seems that there is no need of deleting<br><strong>@fra.costa: </strong>thanks for your response!<br><strong>@g.kishore: </strong>ah got it<br><strong>@anthonypt87: </strong>@elon.azoulay as an FYI ^<br><strong>@elon.azoulay: </strong>@elon.azoulay has joined the channel<br>
