Apache Pinot Daily Email Digest (2020-08-10)

Pinot Slack Email Digest Mon, 10 Aug 2020 19:01:11 -0700

<h3><u>#general</u></h3><br><strong>@katneniravikiran: </strong>Hi All,
I am new to Pinot. I am trying to ingest/upload a 7GB Lineitem TPCH table into 
Pinot. Entire file is getting uploaded as a single segment.
Does Pinot support any configuration to specify segmentation column(column 
based on which segments get created from ingested file/data)?
When I explicitly split file into multiple files then multiple segments are 
getting created. Does Pinot expect pre-segmented data to be 
ingested/uploaded?<br><strong>@jeevireddy: </strong>@jeevireddy has joined the 
channel<br><strong>@andrew: </strong>@andrew has joined the 
channel<br><strong>@andrew: </strong>Hi! I’m evaluating Pinot for the following 
use case and want to know if it’s a good fit, or any best practices to help 
achieve it.


• Ingest events for ~1B total users at ~100k/second
• Run aggregation queries on events filtered on *individual user IDs* at 
~10k/second, each query completing in &lt; 100ms
What I understand is that the data is organized primarily by time (segments) 
and secondarily (within a segment) by indexes. In this case, I tried sorting by 
user ID. To query for a particular user ID, it seems that each segment must be 
queried, since the data is not consolidated by user. The runtime would be O(s 
log n) where s is the number of segments in a particular timeframe and n is the 
number of events per segment.

Thus, it seems that Pinot may not scale when there are tens/hundreds of 
thousands of segments and may not be a good fit here. However, this use case 
seems similar to the use cases at Linkedin, such as the “who’s viewed your 
profile” feature, which also would operate on events for individual users.

Is my understanding correct, and is there anything I’m missing here? Would 
appreciate any thoughts or resources you could point me to. 
Thanks!<br><strong>@anthonypt87: </strong>Hi Team! First post! We’re evaluating 
Pinot for our use case and wanted to get some of your thoughts on if it’s a 
good fit for our use case and/or best practices to make it happen. 

The main complication we’re running into is we feel that we may need to be able 
to mutate our data which may not be a good fit for pinot (maybe this can be 
avoided with some smarter data modeling or some future 
<https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdw-2FcF2F51LKj8tTo0R9Alb88ZzqtDxq8uh1VXj5wNJDD-2FhOUp9hc9O5JFjQYNiyd72Milnf5AiFbpyEnUEOHM7RM9Lpvy8FslGClZj46bb1GZvFaFMq-2F0H29DcTre07Yg-3D-3DXn_l_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxRmISAXL1yNbgD-2FDqVrcfbUl8j6B1msIR6iS6q8DudDUHnFZkpw1NJLxLt9M33o9Oni5eXHNWCoi2OtXOxKoVUCGIcRMQPCGyo9TF1l0J7uZa0DzjFxob4WGuGJ9vyCgE-2FCIF4REkSkboM6netD-2FKPSqqWO-2FH4owrdGfeTqcinQd49oKe02iYITx1c613zQJI-3D>?).
 We’re attracted to pinot because it’s ability to perform fast aggregation and 
reduce eng cost from having to do things like precubing data.

• In particular we have two streams of order data (e.g. you can imagine booking 
details like total price in $, an order id, account id, user name, date, etc) 
that are flowing into our system. 
• The two streams (let’s call them “Fast Stream” and “Accurate Stream”) of 
order data may overlap (i.e. the Fast Stream and the Accurate Stream may both 
have order info for “order 1” but Fast Stream may be the only one that has 
“order 2” or Accurate Stream may be the only one that has "order 3")
• Ideally we want to merge these streams together such that whenever they 
overlap (if they overlap), we use the data from Accurate Stream instead because 
it has richer user details and more accurate reporting of price.
We want to be able to do things like get time based aggregate totals based on 
account id quickly. Is there a good way to model this since we have two data 
sources we want to merge?

Thanks so much for your 
help!<br><h3><u>#random</u></h3><br><strong>@jeevireddy: </strong>@jeevireddy 
has joined the channel<br><strong>@andrew: </strong>@andrew has joined the 
channel<br><h3><u>#troubleshooting</u></h3><br><strong>@elon.azoulay: 
</strong>Yep, under heavy cpu load. There were no gc issues, just latency on 
the liveness check.<br><strong>@quietgolfer: </strong>If I try to fix historic 
data with Pinot, at what point would the new data start serving?  E.g. after 
the whole batch ingestion job completes?  Or is it after each segment gets 
uploaded?

I'm curious if I can get into a spot where inconsistent data is served if I use 
a single batch ingestion job.  Does it matter depending on ingestion route 
(standalone, spark, hive)?<br><strong>@g.kishore: 
</strong><https://u17000708.ct.sendgrid.net/ls/click?upn=1BiFF0-2FtVRazUn1cLzaiMdTeAXadp8BL3QinSdRtJdpzMyfof-2FnbcthTx3PKzMZIvTvz0ZlGzjfnWuiLO3kB-2FQ-3D-3Daq4B_vGLQYiKGfBLXsUt3KGBrxeq6BCTMpPOLROqAvDqBeTxRmISAXL1yNbgD-2FDqVrcfbUl4lbQLwg7Bjt0qad9fw-2BGHAd5aZIcG05wXfr1HfkPx8KldVIRDDd21XV4RyZfuzl8y4pnOyU4KxZyFTaggd2sWu-2BO0s6wwWrA90DWoEgV6PS8oBZ1GOYPxR-2B8hZw10HtGlTyIFnvt8WQYTQHWrLFX0IOWN3NbPAUcu15ag01x4-3D><br><strong>@g.kishore:
 </strong>any time a segment is pushed the maxTime is 
re-calculated<br><strong>@g.kishore: </strong>then broker splits the rewrites 
the query into two different queries<br><strong>@g.kishore: </strong>select 
sum(metric) from table_REALTIME where date &gt;= time_boundary
select sum(metric) from table_OFFLINE where date &lt; 
time_boundary<br><strong>@quietgolfer: </strong>Ah sorry.  Let's say I already 
have offline data for a previous date and I onboard a new customer who wants to 
backfill for that date range.  Without modifying the segment name structure in 
Pinot, what happens if I run a data ingestion job for that 
date?<br><strong>@npawar: </strong>`E.g. after the whole batch ingestion job 
completes? Or is it after each segment gets uploaded?` - you will see new data 
after each segment gets uploaded.<br><strong>@quietgolfer: </strong>Cool.  I'm 
assuming this can cause temporary removal or duplicate metric 
data.<br><strong>@npawar: </strong>yes. only way to avoid that for now, is to 
push a single segment. but that may not always be 
practical<br><strong>@g.kishore: </strong>another option is to partition the 
data based on customerId so that impact is minimal<br><strong>@g.kishore: 
</strong>you can go as far as creating a segment per day per customer but that 
will result in a big skew<br><strong>@quietgolfer: </strong>Yea.  I figured 
that's something we could do longer term.  I was also curious about having 
other virtual tables that reference most of the same segments but can be used 
as a transition<br><strong>@pradeepgv42: </strong>Hi, When we are running a 
query individually we are seeing latencies of the order of 3seconds but when we 
run for example same query parallely we are seeing latencies to be order of 
7-10secs for the third query.
Do the queries run serially? I have one broker and 2 servers as part of my 
setup, also would adding more servers be helpful in parallelizing better for 
optimizing query latencies?<br><strong>@pradeepgv42: </strong>QQ, is there a 
way to optimize/improve the queries of following format (basically time series 
queries)?
```SELECT DATETIMECONVERT(timestampMillis, '1:MILLISECONDS:EPOCH', 
'1:MILLISECONDS:EPOCH', '1:HOURS'), count(*) as count_0 
FROM table 
WHERE   timestampMillis &lt; 1597104168752 and &lt;some filters&gt;
GROUP BY DATETIMECONVERT(timestampMillis, '1:MILLISECONDS:EPOCH', 
'1:MILLISECONDS:EPOCH', '1:HOURS') 
ORDER BY count(*) desc
LIMIT 100```
numEntriesScannedInFilter: 18325029
numEntriesScannedPostFilter: 10665158

I guess caching on client side is simple way to go from our side to decrease 
the latency, wondering if there are any 
alternatives.<br><h3><u>#release-certifier</u></h3><br><strong>@ssubrama: 
</strong>We can always build it a little at a time and get the functionality we 
desire as we go along.<br><strong>@yupeng: </strong>@yupeng has joined the 
channel<br>

Apache Pinot Daily Email Digest (2020-08-10)

Reply via email to