Apache Pinot Daily Email Digest (2021-05-02)

Pinot Slack Email Digest Sun, 02 May 2021 19:00:54 -0700

#general

@j.wise.hunter: @j.wise.hunter has joined the channel
@kautsshukla: Hi All, I have a Realtime table consuming from kafka. As of now it has 5 Billion records. I’m performing look up [predicate] on inverted index keys [userid, eventcategory, eventlabel] with using “metricFieldSpecs” column as timestampist for range condition. My Query is taking too much time to finish > 10 seconds almost. How can i can configure it with best optimised configuration. Query ```select userid,eventlabel,sessionid, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from default.click_stream where eventlabel !='null' and timestampist between 1615833000000 and 1616225312000 group by userid,eventlabel,sessionid```
@mayanks: How long does it take if you just do count(*) without group by but using same where clause? And how many records does it select? My guess is your query selects too many rows and so group by is slow In this case, may be try star tree indexing
@kautsshukla: @mayanks: select count(*) from click_stream , average its taking 250ms
@kautsshukla: @mayanks: It’s taking 5 seconds for ---- select MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from click_stream where timestampist between 1612264456000 and 1619962345000. numDocsScanned : 442963466
@kautsshukla: ```select userid,eventlabel, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from click_stream where timestampist between 1612264456000 and 1619962345000 group by userid,eventlabel. This is taking 60 seconds numDocsScanned 221866025```
@mayanks: Yeah, you are trying to group by 442M records. How many nodes, what’s the jvm heap size? Your query is essentially very expensive
@kautsshukla: I have re created table with star tree : ```{ "tableName": "click_stream_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", "timeColumnName": "created_at", "retentionTimeUnit": "DAYS", "retentionTimeValue": "90", "segmentPushType": "APPEND", "timeType": "DAYS", "replication": "1", "replicasPerPartition": "1", "schemaName": "click_stream" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "autoGeneratedInvertedIndex": true, "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "simple", "stream.kafka.topic.name": "vedantu.dp.click_stream.click_stream", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.hlc.zk.connect.string": "zk1:2181/kafka", "stream.kafka.zk.broker.url": "zk1:2181/kafka", "stream.kafka.broker.list": "kafka01:9092", "realtime.segment.flush.threshold.time": "3600000", "realtime.segment.flush.threshold.size": "50000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" }, "aggregateMetrics": false, "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "nullHandlingEnabled": true, "createInvertedIndexDuringSegmentGeneration": false, "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "eventcategory", "eventlabel", "userid", "sessionid" ], "skipStarNodeCreationForDimensions": [], "functionColumnPairs": [ "MIN__timestampist", "MAX__timestampist" ] } ], "rangeIndexColumns": [ "timestampist", "created_at" ] }, "metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType": "replicaGroup" }, "isDimTable": false }```
@kautsshukla: @mayanks but it doesn’t help here, what I have to follow, do suggest me.
@mayanks: The query is really expensive.
@mayanks: How many servers and what’s the jvm heap
@kautsshukla: Yeah, there are 2 servers with 100GB heap space
@mayanks: What’s the use case? Is this going to be a real production query? Or are you experimenting
@kautsshukla: This is going to be a prod query : ```select userid,eventlabel,sessionid, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from default.click_stream where sessionid !='null' and timestampist between 1615833000000 and 1616225312000 group by userid,eventlabel,sessionid```
@kautsshukla: scanning 3 months ka data.
@mayanks: What’s your time column granularity
@kautsshukla: 1 minute
@mayanks: Hmm can you live with 1 day
@mayanks: Since you are going back 3 months?
@mayanks: And uses aggregate metrics. This will reduce number of rows and improve latency
@kautsshukla: I’ll try now
@kautsshukla: ```"dateTimeFieldSpecs": [ { "name": "created_at", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:DAY" } ]```
@kautsshukla: @mayanks should i try this…
@mayanks: Aslo enable metrics aggregation
@kautsshukla: OK
@kautsshukla: “aggregateMetrics”: false --- > true
@mayanks:
@mayanks: Lets move to troubleshooting
@kautsshukla: @mayanks yes
@kautsshukla: @mayanks Only supported aggregation right now is `SUM` on “aggregateMetrics”: true. Which aren’t using as a part of query. Now My queries are not ever responding in 60 seconds with new table definition with start tree index.
@kautsshukla: This is the schema ```{ "schemaName": "click_stream", "dimensionFieldSpecs": [ { "name": "id", "dataType": "STRING" }, { "name": "ipaddress", "dataType": "STRING" }, { "name": "city", "dataType": "STRING" }, { "name": "region", "dataType": "STRING" }, { "name": "country", "dataType": "STRING" }, { "name": "continent", "dataType": "STRING" }, { "name": "postalcode", "dataType": "STRING" }, { "name": "latitude", "dataType": "DOUBLE" }, { "name": "longitude", "dataType": "DOUBLE" }, { "name": "isp", "dataType": "STRING" }, { "name": "useragent", "dataType": "STRING" }, { "name": "browsername", "dataType": "STRING" }, { "name": "browserversion", "dataType": "STRING" }, { "name": "osname", "dataType": "STRING" }, { "name": "osversion", "dataType": "STRING" }, { "name": "devicename", "dataType": "STRING" }, { "name": "deviceversion", "dataType": "STRING" }, { "name": "clientid", "dataType": "STRING" }, { "name": "loggedin", "dataType": "STRING" }, { "name": "grade", "dataType": "STRING" }, { "name": "target", "dataType": "STRING" }, { "name": "board", "dataType": "STRING" }, { "name": "devicetype", "dataType": "STRING" }, { "name": "referrer", "dataType": "STRING" }, { "name": "utm_source", "dataType": "STRING" }, { "name": "utm_medium", "dataType": "STRING" }, { "name": "utm_campaign", "dataType": "STRING" }, { "name": "utm_term", "dataType": "STRING" }, { "name": "utm_content", "dataType": "STRING" }, { "name": "channel", "dataType": "STRING" }, { "name": "sessionid", "dataType": "STRING" }, { "name": "domain", "dataType": "STRING" }, { "name": "fullurl", "dataType": "STRING" }, { "name": "pagepath", "dataType": "STRING" }, { "name": "pagetitle", "dataType": "STRING" }, { "name": "screenname", "dataType": "STRING" }, { "name": "type", "dataType": "STRING" }, { "name": "appname", "dataType": "STRING" }, { "name": "apptrackingid", "dataType": "STRING" }, { "name": "screenresolution", "dataType": "STRING" }, { "name": "viewportsize", "dataType": "STRING" }, { "name": "eventcategory", "dataType": "STRING" }, { "name": "eventaction", "dataType": "STRING" }, { "name": "eventlabel", "dataType": "STRING" }, { "name": "k1", "dataType": "STRING" }, { "name": "v1", "dataType": "STRING" }, { "name": "k2", "dataType": "STRING" }, { "name": "v2", "dataType": "STRING" }, { "name": "k3", "dataType": "STRING" }, { "name": "v3", "dataType": "STRING" }, { "name": "k4", "dataType": "STRING" }, { "name": "v4", "dataType": "STRING" }, { "name": "k5", "dataType": "STRING" }, { "name": "v5", "dataType": "STRING" }, { "name": "k6", "dataType": "STRING" }, { "name": "v6", "dataType": "STRING" }, { "name": "k7", "dataType": "STRING" }, { "name": "v7", "dataType": "STRING" }, { "name": "k8", "dataType": "STRING" }, { "name": "v8", "dataType": "STRING" }, { "name": "k9", "dataType": "STRING" }, { "name": "v9", "dataType": "STRING" }, { "name": "k10", "dataType": "STRING" }, { "name": "v10", "dataType": "STRING" }, { "name": "useremail", "dataType": "STRING" }, { "name": "firstpagetitle", "dataType": "STRING" }, { "name": "first_fullurl", "dataType": "STRING" }, { "name": "deviceuniqueid", "dataType": "STRING" }, { "name": "deviceid", "dataType": "STRING" }, { "name": "devicebrand", "dataType": "STRING" }, { "name": "devicemanufacturer", "dataType": "STRING" }, { "name": "devicemodel", "dataType": "STRING" }, { "name": "serialnumber", "dataType": "STRING" }, { "name": "buildnumber", "dataType": "STRING" }, { "name": "carrier", "dataType": "STRING" }, { "name": "batterylevel_new", "dataType": "STRING" }, { "name": "firstinstalltime", "dataType": "INT" }, { "name": "isemulator", "dataType": "STRING" }, { "name": "istablet", "dataType": "STRING" }, { "name": "islandscape", "dataType": "STRING" }, { "name": "batterylevel", "dataType": "INT" }, { "name": "firstinstalltimeist", "dataType": "INT" }, { "name": "firstopened_datetime", "dataType": "LONG" }, { "name": "daysfrom_firstopened", "dataType": "INT" }, { "name": "doubtactionablecount", "dataType": "INT" }, { "name": "gapp_userstatus", "dataType": "STRING" }, { "name": "gaid", "dataType": "STRING" }, { "name": "last_attributed_touch_type", "dataType": "STRING" }, { "name": "last_attributed_touch_timestamp", "dataType": "STRING" }, { "name": "last_attributed_touch_timestamp_iso", "dataType": "LONG" }, { "name": "last_attributed_touch_data_tilde_campaign_id", "dataType": "STRING" }, { "name": "last_attributed_touch_data_tilde_branch_ad_format", "dataType": "STRING" }, { "name": "last_attributed_touch_data_tilde_technology_partner", "dataType": "STRING" }, { "name": "last_attributed_touch_data_tilde_banner_dimensions", "dataType": "STRING" }, { "name": "firsttransactiondate", "dataType": "LONG" }, { "name": "gpslatitude", "dataType": "DOUBLE" }, { "name": "gpslongitude", "dataType": "DOUBLE" }, { "name": "gpscity", "dataType": "STRING" }, { "name": "source", "dataType": "STRING" }, { "name": "version", "dataType": "STRING" }, { "name": "userid", "dataType": "STRING" }, { "name": "calc_userid", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "timestampist", "dataType": "LONG" } ], "timeFieldSpec": { "incomingGranularitySpec": { "name": "created_at", "dataType": "LONG", "timeType": "MILLISECONDS" } }, "primaryKeyColumns": [ "userid", "eventcategory", "eventlabel" ] }```
@kautsshukla: It’s not a upsert table. the table config is which I’m using…. ```{ "REALTIME": { "tableName": "click_stream_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", "timeColumnName": "created_at", "retentionTimeUnit": "DAYS", "retentionTimeValue": "90", "segmentPushType": "APPEND", "timeType": "DAYS", "replication": "1", "replicasPerPartition": "1", "schemaName": "click_stream" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "rangeIndexColumns": [ "created_at", "timestampist" ], "autoGeneratedInvertedIndex": false, "sortedColumn": [ "userid", "eventcategory", "eventlabel" ], "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "simple", "stream.kafka.topic.name": "click_stream", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.hlc.zk.connect.string": "zk1:2181/kafka", "stream.kafka.zk.broker.url": "zk1:2181/kafka", "stream.kafka.broker.list": "kfk1:9092", "realtime.segment.flush.threshold.time": "3600000", "realtime.segment.flush.threshold.size": "50000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" }, "aggregateMetrics": true, "enableDefaultStarTree": true, "enableDynamicStarTreeCreation": true, "nullHandlingEnabled": false, "invertedIndexColumns": [ "userid", "eventcategory", "eventlabel" ], "createInvertedIndexDuringSegmentGeneration": true }, "metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType": "replicaGroup" }, "isDimTable": false } }```

#random

@j.wise.hunter: @j.wise.hunter has joined the channel

#troubleshooting

@j.wise.hunter: @j.wise.hunter has joined the channel
@mayanks: @kautsshukla Moving to troubleshooting:
@mayanks: Given that the query is inherently expensive, perhaps you can share a bit about the use case so we can see if the query itself can be modified
@kautsshukla: @mayanks We have two types of queries ```select userid,eventlabel,sessionid, MIN(timestampist) as mint, MAX(timestampist) as maxt, (MAX(timestampist) - MIN(timestampist)) as diff_time from default.click_stream where sessionid !='null' and timestampist between 1615833000000 and 1616225312000 group by userid,eventlabel,sessionid ```
@kautsshukla: select userid, v1, max(timestamp) as timestampist from clickstream where eventlabel=‘app_classroom_video_seen’ and k1 = ‘video_id’ and timestampist between 1608057000000 and 1615965662009 group by userid, v1
@kautsshukla: These queries are taking too much time process every time.. Its taking more than 60 Sec and got timed out Total docs as of now is : 310767668
@mayanks: When I say use case I mean what is the end goal, is this a dashboard, how it is going to be used etc
@kautsshukla: It’s going to be a dashboard and its like 500 around concurrent users may use it in parallel. Every time a users opens this dashboard it will fire these queries.
@mayanks: So if there are 500M records selected in query with 1minutr granularity there will be millions of groups (at least)
@mayanks: Is the dashboard going to show all that?
@kautsshukla: As you suggested i made it to 1 day level granuality. As i can live with 1 day granuality.
@mayanks: Can you do offline also?
@mayanks: For example real-time for 1 day and preaggregate in offline with star tree for 29’days
@kautsshukla: How its gonna help here ?
@mayanks: Offline preaggregation with star tree will be very fast
@kautsshukla: I have to maintain 2 tables one is realtime with one day granulaity and 1 with 29 days ?
@mayanks: Nope
@kautsshukla: offline with 29
@mayanks: Single table
@kautsshukla: Ok, what are the changes i should made it ?
@mayanks: Single table with offline and real-time component. Generate star tree in offline and push
@mayanks: Do you have an estimate of how many unique sessions per day? And how many unique groups you expect?
@mayanks: If we can somehow control the number of unique groups a query has to process that would help
@kautsshukla: I’ll check and let you know

#announcements

@j.wise.hunter: @j.wise.hunter has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2021-05-02)

#general

#random

#troubleshooting

#announcements

Reply via email to