Apache Pinot Daily Email Digest (2020-10-12)

Pinot Slack Email Digest Mon, 12 Oct 2020 19:01:21 -0700

#general

@thomas.griffiths: @thomas.griffiths has joined the channel
@adrian.f.cole: suggestion to whoever is in build eng. nix the random JDKs?
@adrian.f.cole: I'd recommend reducing clutter by not having tests on JREs besides 1.8, 11 (LTS) and latest (15). Java 9 was a mess for example, and the other interim JREs just give a reason for build distraction
@fx19880617: do you observed anything in other java version? Btw, I’m 100% with you that java 9 is a mess :wink:
@chitturikiran15: @chitturikiran15 has joined the channel

#random

@thomas.griffiths: @thomas.griffiths has joined the channel
@chitturikiran15: @chitturikiran15 has joined the channel

#feat-presto-connector

@canerbalci: @canerbalci has joined the channel

#troubleshooting

@canerbalci: @canerbalci has joined the channel

#lp-pinot-poc

@andrew: @fx19880617 that’s promising! how would i access the image via Helm?
@fx19880617: you can directly update the image tag in the helm script
@fx19880617: ```image: repository: apachepinot/pinot tag: latest pullPolicy: IfNotPresent```
@fx19880617: here you can change `tag` to `0.6.0-SNAPSHOT-1126cac9b-20201002-jdk11`
@divcesar: @divcesar has joined the channel

#pql-sql-regression

@steotia: With PQL, it is 394
@g.kishore: hmm, thats crazy
@g.kishore: this is a hybrid table?
@g.kishore: or offline only
@steotia: wondering how this even happened We always optimize the brokerrequest and use the filter from broker request
@g.kishore: do you have the response from both?
@steotia: so it should have used the N-ary filter tree as opposed to PinotQuery's binary filter treee
@g.kishore: I see
@g.kishore: so the filter tree looks nested
@steotia: but the optimizer should have packed all children inside 1 AND
@steotia: the only way engine is using a different filter tree is if it uses the one from PinotQuery
@steotia: which is not optimized
@g.kishore: got it
@steotia: ```// NOTE: Always use BrokerRequest to generate filter because BrokerRequestOptimizer only optimizes the BrokerRequest // but not the PinotQuery. FilterContext filter = null; FilterQueryTree root = RequestUtils.generateFilterQueryTree(brokerRequest); if (root != null) { filter = QueryContextConverterUtils.getFilter(root); }```
@g.kishore: check with Jackie
@steotia: @jackie.jxt recently added support to optimize PinotQuery but the server has always used BrokerRequest
@g.kishore: even without flattening, its wrong to scan 200+ million entries
@g.kishore: ```organizationId != -9223372036854775808```
@g.kishore: this is the leaf and its getting evaluated
@mayanks: I recall that sql filter tree for `a = 1 and b = 2 and c = 3` looks like `(a = 1 and (b = 2 and (c = 3)))`
@g.kishore: yeah
@mayanks: Yeah, so if `orgId != xxx` is lowest level, then it will get evaluated
@g.kishore: this was not the case before
@mayanks: You mean PQL, or SQL before?
@g.kishore: both
@mayanks: SQL was like this a couple of months ago at least
@g.kishore: I think there is something wrong in AND operator as well
@mayanks: AND operator perhaps does the opto only for children at same level
@mayanks: opto of feeding output docId of predicate as input to another
@g.kishore: no, I remember handling all these things
@mayanks: ok
@steotia: ```// OPTIMIZER is used to flatten certain queries with filtering optimization. // The reason is that SQL parser will parse the structure into a binary tree mode. // PQL parser will flat the case of multiple children under AND/OR. // After optimization, both BrokerRequests from PQL and SQL should look the same and be easier to compare.```
@steotia: as long as all children are being arrange under one AND in a non-binary tree fashion, the arrangement of child operations during execution should stay the same right
@mayanks: Jackie put a PR to flatten (but it may have been for theta-sketches only)
@g.kishore: let me summarize
@g.kishore: 1. query optimizing (flattening) is a good thing and we need that 2. At the same time, the AND operator is evaluating one side of the child completely before evaluating other side
@g.kishore: 2 is a big bug
@steotia: 2 should not impact this right -- execution is same for PQL and SQL
@g.kishore: yes, pql is hiding it bcos of query rewrite
@steotia: yes, so we need to make sure first that compatibility is there from query compilation point of view
@steotia: if the queries are not compiled in the same manner, then same execution performance is anyway not guaranteed
@g.kishore: agree, lets fix 1 first
@mayanks: Yes
@g.kishore: but want to call out that 2 is also important
@steotia: agreed
@mayanks: So is always flatten the fix?
@g.kishore: for now, yes
@mayanks: Ok
@mayanks: @steotia let's validate and hot-fix prod
@steotia: so, in this case, I am seeing the following trees: SQL BrokerRequest : filterQuery root AND, 8 leaf operators under it as children PQL BrokerRequest: filterQuery root AND, 3 leaf operators + 1 AND operator as children, the child AND has 5 leaf operators
@steotia: This is actually weird since I thought the whole difference is due to the fact that SQL is not getting flattened and PQL is flattened. However, in this case SQL is flattened
@mayanks: sure you are not swapping them?
@mayanks: if so, yeah this is indeed confusing
@g.kishore: so we need to look into 2 then
@mayanks: The only recent change to andoperator was in june
@mayanks: and in 2018 before that
@mayanks: so should be easy to find
@g.kishore: i remember having discussion with intern on this
@mayanks: It is actually Jackie's change: #5444
@mayanks: Intern seems to not have touched this file `AndFilterOperator`
@g.kishore: ok
@steotia: I think we may have a different problem
@steotia: This is the test code: ```@Test public void testLSI() { String sql = "SELECT SUM(headcount),organizationId,monthsAgo FROM salesAccountsTest WHERE (organizationGeoEntityIds = 102478259 AND organizationSize BETWEEN 1 AND 10 AND geoEntityIds = 103644278 AND industryId = 57 AND sectorId = 16) AND duplicatePositionMemberId = -1 AND monthsAgo IN (0) AND organizationId != -9223372036854775808 GROUP BY organizationId,monthsAgo LIMIT 50000"; String pql = "SELECT SUM(headcount),organizationId,monthsAgo FROM salesAccountsTest WHERE (organizationGeoEntityIds = 102478259 AND organizationSize BETWEEN 1 AND 10 AND geoEntityIds = 103644278 AND industryId = 57 AND sectorId = 16) AND duplicatePositionMemberId = -1 AND monthsAgo IN (0) AND organizationId != -9223372036854775808 GROUP BY organizationId,monthsAgo TOP 50000"; PinotQuery2BrokerRequestConverter converter = new PinotQuery2BrokerRequestConverter(); BrokerRequestOptimizer optimizer = new BrokerRequestOptimizer(); BrokerRequest brSql = converter.convert(CalciteSqlParser.compileToPinotQuery(sql)); BrokerRequest brSqlOptimized = optimizer.optimize(brSql, null); Pql2Compiler pql2Compiler = new Pql2Compiler(); BrokerRequest brPql = pql2Compiler.compileToBrokerRequest(pql); BrokerRequest brPqlOptimized = optimizer.optimize(brPql, null); System.out.println("break point"); }``` Both brSqlOptimized and brPqlOptimized are identical
@steotia: which mean the optimizer is maintaining the compatibility in the final filter tree as it should
@mayanks: I recommend take one segment and debug
@steotia: Yeah I just downloaded one. Will do e2e execution on that
@mayanks: you can also simplify the filters to have least number to show a difference in filtered docs
@jackie.jxt: Let me check
@jackie.jxt: SQL and PQL are both using filter query tree from PQL side, so not sure why there is difference
@jackie.jxt: @steotia Do you put the exact same query into PQL and SQL?
@jackie.jxt: For the testing query? Not sure if PQL takes `LIMIT 50000`
@mayanks: filter should not impact that?
@jackie.jxt: Hmm, correct
@jackie.jxt: There is no difference between SQL and PQL on solving filters, they share the exact same code path
@steotia: I agree. I used the same queries. So, there is good news and bad news
@steotia: Good news is that I am seeing the problem on perf cluster for a customer which has pinot-server from June. I am not seeing on prod. The same queries on prod have exact same metrics.
@mayanks: Does the latest code have this issue?
@steotia: the bad news is that I don't know how this is even happened in June. @jackie.jxt, as part of query context changes, was there ever a time where server briefly moved to PinotQuery (as part of conversion of broker request to QueryContext) and we mistakenly started using filter from PinotQuery
@steotia: latest code doesn't reproduce this issue
@mayanks: I think we should just focus on: ```1. Does the issue exist in prod - if so, mitigate/fix quickly. 2. Does the issue happen in latest code - fix forward ```
@mayanks: We don't need to debug old code
@steotia: it doesn't happen
@mayanks: latest code or prod or both?
@g.kishore: +100 to what Mayank said
@steotia: latest code (in IntellIJ) after downloading segments and also on prod via pinot-query --sql --show-metadata
@mayanks: Do we have test that compares pql with sql filter? If not, perhaps we can add
@steotia: we do
@steotia: PqlSqlCompatibilityTest
@steotia: does exactly this
@steotia: it didn't fail for the above queries
@mayanks: ok
@mayanks: I see it does not have complex filter predicates. perhaps, add some
@jackie.jxt: I think I might have fixed some bug on the SQL path and made it the same behavior as PQL
@g.kishore: Any update here?
@mayanks: Yes, the issue does not exist in latest code
@g.kishore: Cool
@g.kishore: Can we add a test case?
@mayanks: Yes, either @steotia or I will do that
@mayanks: Apparently there is already one. But it likely does not have complex predicates. We just need to add a query
@steotia: I will add tests tonight.

#latency-during-segment-commit

@ssubrama: @ssubrama has joined the channel
@ssubrama: @ssubrama set the channel purpose: This channel is to discuss Issue #6121
@yupeng: @yupeng has joined the channel
@npawar: @npawar has joined the channel
@ssubrama: Hi Yupeng, let us discuss the problem here.
@yupeng: sg
@ssubrama: first off, there does seem to be a disconnect some place. If events are being sourced on an average at 1000 events/s in each partition (is that true)? Then it should not take 2 hours to make 5MB segments unless the compression worked out that well. Can you confirm this for me?
@yupeng: there is some approaximation
@yupeng: 1000events/s on the dashboard is calculated using the max value of the 1 min window
@yupeng: so i think it might be lower than that
@yupeng: let me tweak the diagram a bit to use average
@yupeng: hmm, even for avg. i saw 1k+ for the past day
@yupeng: ``` "rta_eats_canonical_rt_order_states__3__4340__20201011T2303Z", "rta_eats_canonical_rt_order_states__3__4341__20201011T2332Z", "rta_eats_canonical_rt_order_states__3__4342__20201012T0001Z", "rta_eats_canonical_rt_order_states__3__4343__20201012T0031Z", "rta_eats_canonical_rt_order_states__3__4344__20201012T0102Z", "rta_eats_canonical_rt_order_states__3__4345__20201012T0136Z", "rta_eats_canonical_rt_order_states__3__4346__20201012T0217Z", "rta_eats_canonical_rt_order_states__3__4347__20201012T0308Z", "rta_eats_canonical_rt_order_states__3__4348__20201012T0409Z", "rta_eats_canonical_rt_order_states__3__4349__20201012T0541Z", "rta_eats_canonical_rt_order_states__3__4350__20201012T0803Z", "rta_eats_canonical_rt_order_states__3__4351__20201012T0954Z", "rta_eats_canonical_rt_order_states__3__4352__20201012T1111Z", "rta_eats_canonical_rt_order_states__3__4353__20201012T1235Z", "rta_eats_canonical_rt_order_states__3__4354__20201012T1408Z", "rta_eats_canonical_rt_order_states__3__4355__20201012T1528Z", "rta_eats_canonical_rt_order_states__3__4356__20201012T1622Z", "rta_eats_canonical_rt_order_states__3__4357__20201012T1704Z", "rta_eats_canonical_rt_order_states__3__4358__20201012T1741Z"```
@yupeng: these are the latest segments
@yupeng: so it became every 40-50min
@ssubrama: how many rows in each segment?
@yupeng: ```"realtime.segment.flush.threshold.size.llc": "2500000"```
@yupeng: somehow this value is also smaller than last time i saw
@yupeng: i need to figure out what happened
@ssubrama: ou should set the rows to 0 if you want segment size to take effect.
@ssubrama: Scroll down to the bottom here
@ssubrama: Also, read this page in entirety. Since I don't have visiblity into your use case or your machiens, it is hard for me to tel you how to debug this.
@tingchen: @tingchen has joined the channel
@yupeng: Thanks Subbu, let us read it through
@yupeng: we actually did not configure `realtime.segment.flush.desired.size` for this table
@tingchen: I reduce the .threshold.size.llc from 5M to 2.5M last week.
@ssubrama: if the config yuo showed me is correct, then i think what is happening is that you are ingesting 2.5M rows OR whatever your time limit is. I think you are hitting the time limit (the other configuration). The default time lmit is 1 hr (I think).
@ssubrama: Don't use this config. Instead, use the segment size config, as explaine int he documentation
@ssubrama: @tingchen ^^
@yupeng: we have this `"realtime.segment.flush.threshold.time": "86400000",`
@ssubrama: You can set this in easy-to-read format, btw.
@ssubrama: You can say `24h` instead.
@ssubrama: So then you are hitting the row limit of 2.5m rows per partition, and it looks like that segment size is 5MB (compressed). You should check what the uncompressed size is, since that is the amount of memory used in the server.
@ssubrama: If yuo are having a multi-tenant setup, then it gets more complicated. You need to basicaly add up all the memory used by each table in each server, and find the total mem usave.
@ssubrama: usag
@ssubrama: the idea is to keep the mem usage predictable, so u use the segment size config. The tool jelps you determine the segment size.
@yupeng: this is from the table size API ```"rta_eats_canonical_rt_order_states__1__4492__20201009T2155Z": { "reportedSizeInBytes": 362570805, "estimatedSizeInBytes": 362570805, "serverInfo": { "Server_streampinot-prod77-dca8_7090": { "segmentName": "rta_eats_canonical_rt_order_states__1__4492__20201009T2155Z", "diskSizeInBytes": 120856935 }, "Server_streampinot-prod75-dca8_7090": { "segmentName": "rta_eats_canonical_rt_order_states__1__4492__20201009T2155Z", "diskSizeInBytes": 120856935 }, "Server_streampinot-prod76-dca8_7090": { "segmentName": "rta_eats_canonical_rt_order_states__1__4492__20201009T2155Z", "diskSizeInBytes": 120856935 } } },```
@ssubrama: If you specicy 8 hosts (since that is what you have). then try running it with 8 hosts and memory of 128G and then 192G. It will show you mapped memory. Check if you can affortd the paging.
@ssubrama: ah, 360M segment size. And this completes every 2 hours?
@ssubrama: roughly?
@yupeng: i think that’s for 3 replicas?
@yupeng: so each is 120MB
@ssubrama: ah OK.
@ssubrama: How many rows does the segment have? (look in metadata)
@yupeng: ```segment.total.raw.docs = 1250000```
@yupeng: is this?
@yupeng: hmm, why is it half of 2.5M
@tingchen: that is expected. The final segment size is a function of target threshold 2.5M and the number of partition and segments hosted in the server IIRC.
@ssubrama: this is why don't use the numrows config.
@yupeng: @ssubrama do you suggest we only use `realtime.segment.flush.desired.size` ?
@yupeng: and use `RealtimeProvisioningHelperCommand` to determine the segment size during table onboarding?
@yupeng: is this the process for pinot at linkedin?
@ssubrama: yes
@ssubrama: If you want exact flush threshold as you wish to configure in terms of number of rows then you can use another config (that does not divide the num rows by the number of segments) but that is deprecated.
@yupeng: i think your flow makes more sense.
@yupeng: the exact flush threshold seems little use
@yupeng: what we really care is the segment size to optimize for the realtime perf as you called out in the doc
@yupeng: @tingchen I think we need to adopt this flow
@tingchen: Using realtime.segment.flush.desired.size is fine. How can we use `RealtimeProvisioningHelperCommand` for a multi-tenant env though?Our tenant has multiple tables and users can self board tables.
@tingchen: How can use specify the value for `maxUsableHostMemory` when using this tool?
@tingchen: ```maxUsableHostMemory: This is the total memory available in each host for hosting "retentionHours" worth of segments of this table. Remember to leave some for query processing (or other tables, if you have them in the same hosts)```
@ssubrama: this part is still to be figured out -- automatic prov in multi-tenant cluster. for now, you can run this on all tables and figure out the total mem used (all of it should be. mmap) and decide how much paging you are willing to take in order to sacrifice response time.
@ssubrama: if you have a lot of inv indices, then heap mem will be de-allocated at segment build time
@tingchen: ```retentionHours : The realtime segments need to be in memory only until the offline segments are available for use in a hybrid table (for a realtime only table, specify the table retention time in hours). ```
@tingchen: what is the retentionHours for a realtime only table?
@tingchen: this note on the doc seems not completed here. -- for realtime only table, is this parameter applicable?
@tingchen: finding memory (heap memory?) for all tables is somewhat challenging.
@tingchen: is there a per table metrics for this?
@ssubrama: It will pick up from the table config. If you have realtime table only, do not specify any other argument
@ssubrama: sorry, "do not specify this argument"
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]