Apache Pinot Daily Email Digest (2020-08-24)

Pinot Slack Email Digest Mon, 24 Aug 2020 19:00:57 -0700

<h3><u>#general</u></h3><br><strong>@nanda.yugandhar_slack: 
</strong>@nanda.yugandhar_slack has joined the channel<br><strong>@mayanks: 
</strong>Hi Apache Pinot Community, friendly reminder to signup for Apache 
Pinot virtual meetup on Sept 2. Hope to see you there: 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMUHWLijFWt2bRMjF3IjDMnO1Cs7Q9Cn51P3DCI-2B9Pbi5rKLtHHCRdecXufBGYkLeUg-3D-3Dwoyn_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwSToyFB4D-2BzUO01y0DnWaLmsFjqNAB-2FhCfMx11BNe1VfcSBfb5SBm03SGgfOM5qs8-2F-2BHvoIHAp8Iy-2F6EbycW-2FrCmY11THitHeoLtReMubUyry1DRsQF-2FmbUIaz2HoTguQDcDNu2krw6FwSwHuWW28PSVCLcsgsAdxquYefSnA-2FrVM5TA4pclkLQw5A1LCVedA-3D><br><strong>@nanda.yugandhar_slack:
 </strong>Does it support any exactly once guarantees like Druid 
?<br><strong>@mailtobuchi: </strong>Hi, do we publish the latest jars from 
master to maven or only released ones? Needn’t be maven central but any other 
repo.<br><strong>@g.kishore: </strong>only release versions are published to 
maven<br><strong>@g.kishore: </strong>you can always publish other versions to 
your internal 
artifactory<br><h3><u>#random</u></h3><br><strong>@nanda.yugandhar_slack: 
</strong>@nanda.yugandhar_slack has joined the 
channel<br><h3><u>#feat-presto-connector</u></h3><br><strong>@hiboss1: 
</strong>@hiboss1 has joined the 
channel<br><h3><u>#pql-2-calcite</u></h3><br><strong>@hiboss1: 
</strong>@hiboss1 has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@hiboss1: 
</strong>@hiboss1 has joined the channel<br><strong>@yash.agarwal: 
</strong>What is the segment config and name strategy should I use, to support 
the following data ingest
```Data size: 3 Years
Daily Raw Orc Size: 4GB
Daily Segment Counts: 11
Data Full Refresh: Weekly
Data Append: Daily```<br><strong>@pradeepgv42: </strong>QQ, w.r.t to the way 
broker processes queries for hybrid tables.
I see in the docs that query gets split between offline and realtime based on a 
timestamp.


My take away from here is that data has to be continuous in offline table for 
all the previous days,
is it possible to just make broker to query offline table for a window of time 
instead.

Usecase is when we have some issue/holes in the data (lets say whole of 
yesterday). we want to be able
to fix the segments only in that timerange, instead of maintaining a offline 
table alltogether. Are there alternative ways to fix 
this?<br><h3><u>#inconsistent-perf</u></h3><br><strong>@mailtobuchi: 
</strong>@mailtobuchi has joined the 
channel<br><h3><u>#pinot-0-5-0-release</u></h3><br><strong>@g.kishore: 
</strong>where are we on the release?<br><strong>@jackie.jxt: </strong>There 
are several new features and optimizations recently committed (e.g. 
post-aggregation and having, exact distinct count, transform/filtering during 
ingestion). Wondering if we can include them in this 
release<br><strong>@ssubrama: </strong>The release is a moving target, and the 
more we start including, the more we want to include. It is almost as much work 
to just create a new release after this, with all the 
optimizations<br><strong>@g.kishore: </strong>agree with Subbu, 0.5.0 already 
has lot of features<br><strong>@g.kishore: </strong>lets start on 0.6.0 after 
the meetup<br><strong>@g.kishore: </strong>release process itself seems to take 
a month for some reason<br><strong>@g.kishore: </strong>we need to bring it 
down<br><h3><u>#roadmap</u></h3><br><strong>@nanda.yugandhar_slack: 
</strong>@nanda.yugandhar_slack has joined the 
channel<br><strong>@nanda.yugandhar_slack: </strong>@nanda.yugandhar_slack set 
the channel purpose: This is for discussing roadmap of 
Pinot<br><strong>@g.kishore: </strong>@g.kishore has joined the 
channel<br><strong>@nanda.yugandhar_slack: </strong>1. Do Pinot has multi cloud 
planning ? Means using s3, GCS, Azure blob storage as deep storage and then run 
local clusters in each data centre ?<br><strong>@nanda.yugandhar_slack: 
</strong>2. We are planning to use it in Production, when can we expect the 
v1.x ?<br><strong>@nanda.yugandhar_slack: </strong>3. In Druid, segment loading 
algorithm is much more efficient, are we planning to implement same soon 
?<br><strong>@g.kishore: </strong>1. Yes, thats how we run it at 
LinkedIn<br><strong>@g.kishore: </strong>2. you can do that today, the batch 
ingestion job takes the multiple cluster coordinates as 
input<br><strong>@g.kishore: </strong>3. can you elaborate on "segment loading 
algorithm is much more efficient in Druid"<br><strong>@nanda.yugandhar_slack: 
</strong>So, while assigning segments to servers, Pinot finds a server with 
"least number of segments", right ?<br><strong>@nanda.yugandhar_slack: 
</strong>However, at metamarkets they used 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMc5wQ1jEBdV6-2B6n1O0hRzSLZY5XIolhAKH6RcrbdUFPOHiHDkAOodzKsnkC18vASkfwKxVZts534WyA-2FaIlEHGNjzMvzsUHDJ9GQfV3erlwWp-PS_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwSToyFB4D-2BzUO01y0DnWaLgcmrux0NBtaCGQ8K-2Bx6E9-2BO3yvw4bUpobshbXCXZnqUAjcXp6BfM1YzGtLaP6it57NC8Za1a-2Fc-2BEGvfAaVD8iY-2Bs8GXpav-2FGNoEkoYz4KJm3BJyK4UXiNr0XALYkTxLqz6isKOJ16unB3V8mayMkSrtLOvR42UvXgTLjqQwF3wM-3D>,
 and used some cost based load balancing!<br><strong>@nanda.yugandhar_slack: 
</strong>What about this one ?<br><strong>@nanda.yugandhar_slack: </strong>4. 
Are we planning to do any accounting, like how much resources used by 
particular team ?<br><strong>@g.kishore: </strong>ah, got it. there are other 
algorithm in Helix we can use for segment 
assignment<br><strong>@nanda.yugandhar_slack: </strong>And this found to be 
giving 30-40% performance gains!<br><strong>@g.kishore: </strong>interesting, 
will read it and see if it makes sense.<br><strong>@g.kishore: </strong>we had 
an intern who played with placement strategies and we dint really see a big 
difference<br><strong>@nanda.yugandhar_slack: </strong>5. we are deciding stage 
of choosing platform, I consider these things are important for data platform 
(we are treating as "infrastructure for intelligence"), Do you see we are 
missing something ?<br><strong>@g.kishore: </strong>so cost, efficiency and 
scalability are not important?<br><strong>@g.kishore: </strong>others look 
good<br><strong>@nanda.yugandhar_slack: </strong>Optimising takes care of cost 
and efficiency! in the details of each part, we are considering the 
factors!<br><strong>@g.kishore: </strong>looks like you have it covered 
then<br><strong>@nanda.yugandhar_slack: 
</strong><br><strong>@nanda.yugandhar_slack: </strong><br><strong>@g.kishore: 
</strong>is this analytics on sensor data?<br><strong>@nanda.yugandhar_slack: 
</strong>it's common infra our company, First I documented our company 
requirements, then checking which one should do!<br><strong>@g.kishore: 
</strong>which company?<br><strong>@nanda.yugandhar_slack: </strong>it's few 
months old startup, nurture.farm!<br><strong>@g.kishore: </strong>nice 
name<br><strong>@nanda.yugandhar_slack: </strong>For example, in query 
perspective we got the following!<br><strong>@g.kishore: </strong>very nice 
work!<br><strong>@g.kishore: </strong>for joins in Pinot, we use Presto/Pinot 
connector with push down<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMfqYsxUxY3tMgaREaFijmtA2t4OTJPSnmtRwiFyyArHF1Dm9g3qwQwBkvW7Agej6LDe7Z4GDYavFmLgmWnewPow-3DkF9C_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwSToyFB4D-2BzUO01y0DnWaLhNSAy9ihq7-2Fw3Xjh3V5OgVEg4qXTQf4qrVhr2qyQ2u8ReQiw-2BDe3mWC6FJx1dGJEUAI1Khmxat0OCq1awnZCU1AQrlXo9w-2BLgwJbvzL9whZBAhugGZ34RORq09sZFEDqNearXFkwmg-2FPDcRyO5SRAvN1yGgtvJEbB3guJPr-2B04s-3D><br><strong>@g.kishore:
 </strong>we plan to focus on Pinot being fastest for single 
table<br><strong>@g.kishore: </strong>high ingestion rate, many dimensions, 
high throughput query, low latency -&gt; this is exact scenario for which we 
built Pinot<br><strong>@g.kishore: </strong>I would not worry about segment 
loading strategy a lot for now until you find an issue<br><strong>@g.kishore: 
</strong>these things look good on paper but hard to solve all use cases using 
one algorithm<br><strong>@g.kishore: </strong>Nice part of Pinot is that this 
strategy is configurable per table basis<br><strong>@g.kishore: </strong>so you 
can change it any time and invoke rebalance<br><strong>@nanda.yugandhar_slack: 
</strong>Yeah, Oh Good to hear that configuration is per 
table<br><strong>@nanda.yugandhar_slack: </strong>okay<br><strong>@g.kishore: 
</strong>and segments will move around<br><strong>@g.kishore: </strong>we do 
have advanced assignment strategies based on tenants, tags, partitions and 
replicas<br><strong>@g.kishore: </strong>&gt; Yeah, Oh Good to hear that 
configuration is per table
Yes, it was hard to have one algo solve all the use 
cases<br><strong>@nanda.yugandhar_slack: </strong>ok, any rough timeline stable 
version (I understand these projects have very uncertainty in timelines) ? 
Since we are planning to use in production, I'm just 
curious<br><strong>@g.kishore: </strong>this is the code that runs at 
LinkedIn<br><strong>@g.kishore: </strong>so it is stable<br><strong>@g.kishore: 
</strong>incubator is more of Apache 
terminology<br><strong>@nanda.yugandhar_slack: </strong>okay! By the way which 
visualisation tool is preferred ? Is Apache Superset is good 
?<br><strong>@g.kishore: </strong>yes, most of them use that. Its a great 
start<br><strong>@g.kishore: </strong>most companies end up building their own 
at some point<br><strong>@nanda.yugandhar_slack: </strong>okay, thanks! are you 
planning to move to Java 11 with ZGC ? At this latency level GC could impact 
the &gt;p95<br><strong>@nanda.yugandhar_slack: </strong>Many people faced it 
with Cassandra, So we relied on Scylladb<br><strong>@g.kishore: </strong>we 
haven't looked into it yet. thats good to know<br><strong>@g.kishore: 
</strong>this is at nurture farm?<br><strong>@nanda.yugandhar_slack: 
</strong>At Ola<br><strong>@nanda.yugandhar_slack: </strong>Even in 
nurture.farm we avoid Cassandra DB!<br><strong>@g.kishore: </strong>do you have 
detailed analysis on zGC<br><strong>@g.kishore: </strong>we are debugging 
something with leanplum on p99<br><strong>@g.kishore: </strong>and suspecting 
GC<br><strong>@nanda.yugandhar_slack: </strong>But we liked it in Java only, 
Over a period of time we want many developers work on the code 
base!<br><strong>@nanda.yugandhar_slack: </strong>I will find analysis for ZGC 
and post it, For last one year I worked on very closed 
system!<br><strong>@nanda.yugandhar_slack: </strong>I missed these 
links!<br><strong>@g.kishore: </strong>no worries, reading about 
it.<br><strong>@nanda.yugandhar_slack: </strong>But the idea is it won't reduce 
any burden on GC, but reduce the average "Stop the world" time by great extent, 
If I remember well, for large heaps it'll help even 
more!<br><strong>@nanda.yugandhar_slack: </strong>Does anybody used Pinot for 
logs use case ? Even though in docs said that it can be used, we don't find any 
articles on them! We want to use for comments (from users) matching as 
well!<br><strong>@g.kishore: </strong>yes, we haven't done anything 
specifically for that but we are seeing folks starting to use it for logs use 
case<br><strong>@g.kishore: </strong>its not Pinot's strength but we have the 
primitives like Text Index that can be enabled on specific 
columns<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMa1aAdGoOdRoyIvGAevnwx4MmluqK5BbqVUbYmCA2ouJu-2BJL3R5oHgSYeAZ4notlClrvQC8W-2F72FD1n4B40-2FDJg2Zl8bvTQi1muXr33QC7ETCeT-2FnOLxpRLM5b55-2BtAsiA-3D-3D3PQ7_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTwSToyFB4D-2BzUO01y0DnWaLCmp2SXEgcnaJJGMUdZ9St9MGiU30xUrWA1rXXTM-2FWsSCgZlr7PklhqTbtceK3ZhBafdV6QHb-2BfHfHoFzFyqsYCjx46JxZ3Ce9B3fEoXrDSymBITcX3T1My0mZPmL04ZIEpfWu-2B2D3hu60sJWpu3aQ-2B5Vsjx9xlu4AvNV0a5oBEM-3D><br><strong>@nanda.yugandhar_slack:
 </strong>Thanks for the reading!<br><strong>@g.kishore: </strong>I can put you 
in touch with them if you need more info<br><strong>@nanda.yugandhar_slack: 
</strong>That would definitely help!<br><strong>@nanda.yugandhar_slack: 
</strong>Thanks for your time today!<br><strong>@g.kishore: </strong>when do 
you want to talk to them?<br><strong>@nanda.yugandhar_slack: </strong>Their 
scale and learnings from it, especially in terms of manageability, Given that 
they are using with Lucene (Another moving 
part)<br><strong>@nanda.yugandhar_slack: </strong>It looks like they are some 
inherent inefficiency, for AND queries they are fetching documents separately 
for each query and doing joins of documents, which is pretty bad for 
scalability! I want it in such a way that full query has to be executed on 
table segment itself then return the result!<br><strong>@g.kishore: 
</strong>Lucene is temporary thing, we plan to replace it 
soon<br><strong>@g.kishore: </strong>the only part we need is the fuzzy 
matching part<br>

Apache Pinot Daily Email Digest (2020-08-24)

Reply via email to