Apache Pinot Daily Email Digest (2020-08-11)

Pinot Slack Email Digest Tue, 11 Aug 2020 19:00:47 -0700

<h3><u>#general</u></h3><br><strong>@fra.costa: </strong>@fra.costa has joined 
the channel<br><h3><u>#random</u></h3><br><strong>@fra.costa: 
</strong>@fra.costa has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@g.kishore: </strong>How 
long does it take?<br><strong>@g.kishore: </strong>You can create another 
column for hour and add star tree if you want something really 
fast<br><strong>@pradeepgv42: </strong>it takes close to 30seconds for weeks 
worth of data, but I am being a bit miserly on cpu (doubling it reduced the 
latency to half, 16cores)
Yeah will probably add an additional column or cache on the client 
side.<br><strong>@pradeepgv42: </strong>thanks<br><strong>@g.kishore: 
</strong>Range index on time column might also help but not sure how much it 
will help with time column since data is already  time 
partitioned<br><strong>@pradeepgv42: </strong>yeah, true, star tree with with 
(date, hour)  might be easier and faster it seems<br><strong>@fx19880617: 
</strong>how many groups you have here?<br><strong>@pradeepgv42: </strong>For 
time range queries should be less than ~1000, for other group by queries 
~10000<br><strong>@elon.azoulay: </strong>qq - are there any performance 
differences between dimensionFields and metricFields? We had a user create a 
table with only dimensionFields and star tree indexes, etc. and didn't see any 
problems. Should the fields that were aggregated on (i.e. sum__&lt;column&gt;) 
have been metric fields?<br><strong>@elon.azoulay: </strong>In the docs it just 
says metric columns "typically" appear in aggregations but in this case 
dimension columns appear in the aggregations - and we don't see any issue with 
that. Are there any caveats to aggregating on dimension columns or grouping by 
metric columns?<br><h3><u>#dhill-date-seg</u></h3><br><strong>@npawar: 
</strong>@npawar has left the 
channel<br><h3><u>#pinot-dev</u></h3><br><strong>@tingchen: </strong><!here> I 
am working on the release notes of the coming Pinot 0.5 release. Here is a doc 
containing all the PRs merged since 0.4 for your reference 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMc9VK8AZw4xfCWnhVjqO8F2-2BFy089M5oqAvPMy33xfxW8FTDbYaPQ9CuX3FIc-2Fzc79-2F7m9BH8sqPl1NqCth-2F5m-2FE0GMJIzirtdLmh5jmRmBATYP4_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8-2Frt-2BwMSGCsK9bzLbvOMJB2uv-2B2ak0HDgoEKzmxVjvXibYT8O39NGTEfRq9HVbHMLbr-2BpPgGXrO84YMLjB-2BgrNYxM-2FqtPoIQbyEWlLPewZTPPRwAbwFRAz0IpIsSd5-2F-2BbB5zQHKgauNIluyydHli1MC8eZdJLMtyrRizueEqYjQY-3D>.
 Can you please email me at <mailto:[email protected]|[email protected]> or 
reply a similar email thread on dev@ email list if you have any of the 
following in the PR list?<br><strong>@tingchen: </strong>• Important features 
added in the new release.
• Any deprecated configuration or external methods, noting these these may be 
removed later.
• Any configuration/methods that was deprecated earlier, and have been removed 
from this release. 
• Any special notes for upgrading from previous releases (e.g. for 
zero-downtime upgrade, please upgrade to release X before this one).
• Other special notes
<br><strong>@g.kishore: </strong>do git log one line<br><strong>@g.kishore: 
</strong>```git log --pretty=oneline```
<br><strong>@tingchen: </strong>that will make the PR list looks better I 
guess. You can also search your id in the 50-page doc to see the PRs you have 
submitted.<br><h3><u>#release-certifier</u></h3><br><strong>@hara.acharya: 
</strong>have created the binitial framework and the lib with cluster creation 
method and also have logging 
enabled<br><h3><u>#pinot-docs</u></h3><br><strong>@g.kishore: 
</strong>@mayanks<br><strong>@mayanks: </strong>@mayanks has joined the 
channel<br><strong>@g.kishore: </strong>@ssubrama<br><strong>@ssubrama: 
</strong>@ssubrama has joined the channel<br><strong>@g.kishore: 
</strong>@steotia<br><strong>@steotia: </strong>@steotia has joined the 
channel<br><strong>@g.kishore: </strong>Hi @steotia @mayanks @ssubrama, Kartik 
is working on the docs and adding configuration reference to controller, server 
and broker<br><strong>@g.kishore: </strong>he needs some help on some of the 
properties<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZi1MZSkbl4JFUtJ8U-2FoWTVgoCFqcP56mjeZC9WiZvM7c-3DimPZ_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8UjRhdu-2F-2FlbPntuoa0O1ngPAWB-2FbTNsnpmKM-2F65z-2Fk4n1lWdYkYu7-2B2ZyQiA-2B-2BwzpSoDWXex7S-2BKdqBlI1Ffcp5tbd5Jsa44k1s-2BFVo1f08OqOS2XBAdkveB4-2FC5IXuOwrF8SyFOMD9aJAc167Qg830edisYHC9pD3r8eZmd8J2I-3D>
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZ9HoeYMvqwNg0Ruekl7GXWt3W-2FZIOHPR0DQRlVGHAHAg-3D-DeQ_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8Npd6RC5y7ON1J-2BXn2v-2B1eZtkwczv39osctV9gXKSlTgbOI6A-2B3sPdE72k5tCvrNZsq7uKBXy-2BcHh3iZUCdepWR3xgaC09QVgs-2BBLiNzGLcetQf6QDIrnJlb09c0IaWlCLUhWuaxO-2Bkrn9NNa-2BVWbc0qHopISd3RtMbwIuqi0iLs-3D>
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdp60AJx08J5D0h7qkwrxysZucSma-2FCJEuYTtSasVHSlu3BV7wymQPbmxUsk7N5Fl8Q-3DzRXC_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8jKSjgHtepRam2LxZLLwRcaY-2Ffh3t2lFxCxse2c3ajnSWFBnUgvNL-2BKOAAzBk42y5DeXBNAbrhpNFGaWOEx10SiWIoLgYwX82wvOHltSfOk6VS0CRvqpnPYOgSROIEmXLCAYmQLhOJB0LyP2or6hqDghPoFLMqAHU6fsr-2B0tWWuE-3D><br><strong>@g.kishore:
 </strong>can someone help with the description<br><strong>@g.kishore: 
</strong>@npawar<br><strong>@npawar: </strong>@npawar has joined the 
channel<br><h3><u>#multiple_streams</u></h3><br><strong>@anthonypt87: 
</strong>@anthonypt87 has joined the channel<br><strong>@g.kishore: 
</strong>@g.kishore has joined the channel<br><strong>@g.kishore: </strong>Can 
you provide more info on ingest mode/ rate<br><strong>@g.kishore: </strong>Also 
what kind of queries do you plan to run<br><strong>@g.kishore: </strong>Also 
dome info on how data is getting generated <br><strong>@anthonypt87: </strong>• 
Let’s say we had over events from both sources with order id, store_id, org_id, 
user name, total, date (orgs consist of many stores)
• Fast data source we’re receiving it realtime as orders are happening (~1 
event per second)
• Accurate data source we get in bunches cronned daily (this can be on the 
order of ~10000 at the beginning of the day)
• Common queries we want are total order by store_id or org_id in date ranges 
like 1 day, 7 day, 30 days. Get an order by order id
We have on the order of 1000s of organizations and 10000s of 
stores<br><strong>@anthonypt87: </strong>thanks again kishore for taking the 
time to help!<br><strong>@g.kishore: </strong>I see<br><strong>@g.kishore: 
</strong>This should be easy<br><strong>@g.kishore: </strong>You can have 
concept of event time in your schema<br><strong>@g.kishore: </strong>When you 
push daily data to offline table, Pinot will automatically use 
that<br><strong>@g.kishore: </strong>This is how hybrid tables work in 
Pinot<br><strong>@surendra: </strong>@surendra has joined the 
channel<br><strong>@anthonypt87: </strong>interesting. I have to read more 
about this<br><strong>@g.kishore: </strong>See this page in 
docs<br><strong>@anthonypt87: </strong>does pinot just choose data in one data 
source over the other?<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdpzMyfof-2FnbcthTx3PKzMZINIDBWitJubpzb4wHRHhZoBqjPgernI234COh-2BttwZkaHCMzHiuUI9GGbIGpqTMc8STl48r2lJvReZ8fJO5meS5Wkh3CLZO4I8DjoJ0ofmPw-3DdiSM_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8pNfJEZRy21UM04E49toCzEwEdVL0m-2F1fsWiHPLjmQ1jZBAVZbu-2FIN3EMDq4UdQl-2BohjUbJ4n-2BrR2WrN14ctSLNURcJdhl2MnFHk2ZhnqnESYoMn7roRG8l9i0IOLEUa8r8fFPYVogbSM0scmy8XAnMwyYZVg0Vbn92dYpKlfbjg-3D><br><strong>@g.kishore:
 </strong>Yes, it picks data in offline table over real-time 
table<br><strong>@anthonypt87: </strong>oh interesting<br><strong>@anthonypt87: 
</strong>this actually may be exactly what we want<br><strong>@anthonypt87: 
</strong>adding my colleague to this room. @fra.costa  ^<br><strong>@fra.costa: 
</strong>@fra.costa has joined the channel<br><strong>@anthonypt87: 
</strong>thanks so much for steering us towards this direction 
kishore!<br><strong>@anthonypt87: </strong>this is just 
hypothetical/educational but was wondering how we can handle this if the data 
actually wasn’t “offline”<br><strong>@anthonypt87: </strong>and we had two 
realtime streams?<br><strong>@anthonypt87: </strong>where we preferred one over 
the other<br><strong>@g.kishore: </strong>It’s hard without upsert 
<br><strong>@g.kishore: </strong>You can write a separate flink 
job<br><strong>@g.kishore: </strong>That merges the stream by using local 
store<br><strong>@g.kishore: </strong>And writes the events to another Kafka 
topic<br><strong>@anthonypt87: </strong>it’s tricky because we would still want 
the “fast data” first<br><strong>@anthonypt87: </strong>if it’s 
available<br><strong>@anthonypt87: </strong>are you talking about some sort of 
windowing?<br><strong>@g.kishore: </strong>Yeah that will 
work<br><strong>@g.kishore: </strong>You maintain a order ids that have already 
been emitted <br><strong>@g.kishore: </strong>Another option is dedup during 
ingestion based on order I’d<br><strong>@anthonypt87: </strong>would that delay 
data since we’d have to wait the “accurate” data source is 
available?<br><strong>@g.kishore: </strong>No<br><strong>@g.kishore: 
</strong>You write data as they come in<br><strong>@g.kishore: </strong>You 
just check if it already exists<br><strong>@g.kishore: </strong>You need 
another source to maintain list of order ids already ingested in the last X 
days<br><strong>@anthonypt87: </strong>hmm just to make sure I understand. say 
you had fast data first and then accurate data come in for order w/ id 
“1”<br><strong>@anthonypt87: </strong>fast data would come in and in your flink 
job you’d check the local data store. it doesn’t exist so you would store it in 
the local intermediate data store and then also write it out to your main one 
that serves your queries<br><strong>@anthonypt87: </strong>what do you do when 
accurate data comes in later? you check if it exists in local data store and it 
does. what’s the next step?<br><strong>@g.kishore: </strong>Skip 
it<br><strong>@anthonypt87: </strong>ah but I prefer data from the accurate 
data store<br><strong>@g.kishore: </strong>Why?<br><strong>@g.kishore: 
</strong>It has more fields or it’s really more 
accurate?<br><strong>@anthonypt87: </strong>the accurate data may have more 
accurate totals<br><strong>@anthonypt87: </strong>both<br><strong>@g.kishore: 
</strong>Got it<br><strong>@g.kishore: </strong>My bad, I missed that 
part<br><strong>@anthonypt87: </strong>no worries!<br><strong>@g.kishore: 
</strong>Then you need upsert<br><strong>@anthonypt87: </strong>yikes. would 
the hybrid table still be viable solution?<br><strong>@anthonypt87: </strong>it 
looks like if we set up the hybrid tables to operate over time periods, when it 
switches segments from online -&gt; offline, it just uses the offline data 
based on time period<br><strong>@anthonypt87: </strong>so we’d lose information 
only found in the realtime segment (since the sets may not 
overlap)<br><strong>@g.kishore: </strong>Yes, hybrid solution will 
work<br><strong>@g.kishore: </strong>Right<br><strong>@anthonypt87: 
</strong>since we care about the superset of orders, across both segments, will 
it still work even if we push the offline segment to replace the recent time 
period?<br><strong>@anthonypt87: </strong>(i.e. essentially what we want is 
deduping on order id)<br><strong>@vallamsetty: </strong>@vallamsetty has joined 
the channel<br><strong>@anthonypt87: </strong>just continuing the discussion 
from yesterday, how do we see hybrid tables working in this case? e.g. we want 
all orders in the past 30 days (some orders are in the offline system, some in 
the online, and some in both)<br><strong>@g.kishore: </strong>accurate data you 
get at the end of day is superset of fast data rt?<br><strong>@anthonypt87: 
</strong>it is not unfortunately<br><strong>@anthonypt87: </strong>accurate 
data we get from a separate source and it may not include all the orders in 
fast data<br><strong>@g.kishore: </strong>ah<br><strong>@g.kishore: 
</strong>ok, then you need have a job that merges the fast data segments and 
accurate source<br><strong>@g.kishore: </strong>the fast data segments are 
available in the deep store as they get flushed<br><strong>@g.kishore: 
</strong>hybrid concept will still work, its just that you have to do some 
additional work while generating the batch segment<br><strong>@anthonypt87: 
</strong>oooh I see<br><strong>@anthonypt87: </strong>so it’s two 
part<br><strong>@anthonypt87: </strong><br><strong>@anthonypt87: 
</strong>something like this. haha sorry literally back of the 
envelope<br><strong>@anthonypt87: </strong>I think that’ll 
work<br><strong>@anthonypt87: </strong>thanks kishore 
:muscle:<br><strong>@g.kishore: </strong>conceptually 
yes<br><strong>@g.kishore: </strong>but let a draw a better 
diagram<br><strong>@g.kishore: </strong>hopefully :wink:<br><strong>@g.kishore: 
</strong><br><strong>@anthonypt87: </strong>haha thanks kishore. 
lucidchart?<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=iSrCRfgZvz-2BV64a3Rv7HYagbhNPz9W4U6hVglMW76xY-3DDT4__vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTyBjIMQak4eVPIffkZZmgj8h1hjXw1rAKD9YH-2FBp3eKQR9cqkIXOpCM9V1pwjiEJOhnFH3bw7YyeVd8gAZb-2FmgyOdrK4fsjpsD-2BbIWmOMn8HhirwJ5G8fREo9IVxhwiidBC5BN0rH-2FjZPY6xz-2BzntRYsPji-2BX9KCxVrZz6IZJ-2FgtdRmH0rGPAAcDKvjvH7ttn4-3D><br><strong>@fra.costa:
 </strong>@g.kishore is deleting rows (or more in general segments) something 
that Pinot supports? I found this config parameter for retention of deleting 
segments, but I can’t operatively understand how do you go about doing it


In other words is there a way in which I can delete a set of rows between 2  
points in time?<br><strong>@g.kishore: </strong>We don’t support 
that<br><strong>@g.kishore: </strong>You can delete at segment 
granularity<br><strong>@fra.costa: </strong>Anthony pointed out that the 
offline rows win over the live, my question on delete was really to address 
that at serving and it seems that there is no need of 
deleting<br><strong>@fra.costa: </strong>thanks for your 
response!<br><strong>@g.kishore: </strong>ah got it<br><strong>@anthonypt87: 
</strong>@elon.azoulay as an FYI ^<br><strong>@elon.azoulay: 
</strong>@elon.azoulay has joined the channel<br>

Apache Pinot Daily Email Digest (2020-08-11)

Reply via email to