Apache Pinot Daily Email Digest (2021-12-07)

Pinot Slack Email Digest Tue, 07 Dec 2021 18:00:34 -0800

#general

@vishal.garg: @vishal.garg has joined the channel
@ahmednagwa6: @ahmednagwa6 has joined the channel
@karinwolok1: :wave: Hi and welcome to all the new Apache Pinot community members! :smiley: We're happy to have you! :heart: Please introduce yourselves, tell us who you are and what brought you here! :pencil: @sebastiaan.nicaise @vishal.garg @ahmednagwa6 @abhishek @krtalic.vedran @krivoire @michael.bracklo @kishansairam.adapa9 @cristobal @contact933 @ellie.shen98 @mathieu.druart @lipicsbarna @shyam @prashant.korade @maarten.dubois @sebastian.schulz @xingxingyang @linhan @changtongsu @pavel.stejskal650 @flora.fong @kiril.lstpd @sainikeshk
@krtalic.vedran: Hello everyone :clap:, first of all thanks for accepting me. I'm software/data engineer at working on data pipelines. Currently we're POC-ing Pinot as user facing analytics solution and just wanna be on the "source" of information. Gratefull that you're open sourcing this nice project.
@zeke.dean: @zeke.dean has joined the channel
@jeff.moszuti: @jeff.moszuti has joined the channel
@ssawyer: @ssawyer has joined the channel

#random

@vishal.garg: @vishal.garg has joined the channel
@ahmednagwa6: @ahmednagwa6 has joined the channel
@zeke.dean: @zeke.dean has joined the channel
@jeff.moszuti: @jeff.moszuti has joined the channel
@ssawyer: @ssawyer has joined the channel

#troubleshooting

@weixiang.sun: One quick question about instanceAssignmentConfigMap part of my upsert table configuration: ``` "instanceAssignmentConfigMap": { "CONSUMING": { "tagPoolConfig": { "tag": "Upsert_REALTIME", "poolBased": true }, "replicaGroupPartitionConfig": { "replicaGroupBased": true, "numReplicaGroups": 3, "numInstancesPerPartition": 0, "numPartitions": 1 } }, "COMPLETED": { "tagPoolConfig": { "tag": "Upsert_REALTIME", "poolBased": true }, "replicaGroupPartitionConfig": { "replicaGroupBased": true, "numReplicaGroups": 3, "numInstancesPerPartition": 0, "numPartitions": 1 } } }``` Why does the above configuration cause duplicate records in the upsert table? It will become good after changing “numInstancesPerPartition” from 0 to 1?
@vishal.garg: @vishal.garg has joined the channel
@vishal.garg: How can I download the table data as CSV? From UI only 10 records are exported.
@mayanks: If you are looking for exporting all data out of Pinot, that is not a good use case for it. As for UI, you can increase the records returned by using `limit`
@vishal.garg: thanks :slightly_smiling_face: . It seems, if we don't restrict results with a limit, only 10 records are shown on UI.
@dunithd: Alternatively, you can use Java or Python APIs to extract the data in Pinot. For example, you can write a simple Python script to query a Pinot table and write the output into a CSV file.
@vishal.garg: it's really helpful. thanks a lot :slightly_smiling_face:
@ahmednagwa6: @ahmednagwa6 has joined the channel
@ahmednagwa6: Hello Guys it is a basic question "I Know that pinot currently not handling null values curretly " Reading this article what i understand is there is a work around to at least filter by NULL in select statement where column IS NOT NULL .. I used to get errors while injesting data from kafka to a table because if this null issue so i modified my table config to have default values for field that might contain null to avoid such an error"not sure if that's an ok practice i did so both in transform injest function and schema default values as follow { "columnName":"special_reference", "transformFunction":"JSONPATHSTRING(json_format(payload),'$.after.special_reference','null')" }, { "columnName":"customer_id", "transformFunction":"JSONPATHLONG(json_format(payload),'$.after.customer_id',-2147483648)" }, ------- schema }, { "name" : "customer_id", "dataType" : "INT", "defaultNullValue": -2147483648 }, { "name" : "user_id", "dataType" : "INT" },{ "name" : "special_reference", "dataType" : "STRING", "defaultNullValue": "null" }, also i added nullhandlingenabled to true "tableIndexConfig": { "loadMode": "MMAP", "nullHandlingEnabled": true what i was expecting that when i filter by is NULL in the query editor it will be able to to map 'null' originated from actual null and return the right query " Am i wrong please help me? I.E select * from next_intentions_trial4 where special_reference IS NOT NULL limit 10 returns record where special_reference null or pinots 'null' i should say I used later version of pinot on kubernetes
@mark.needham: do you know an identifier of one of the rows that you expect to be null so that we can see what it's actually returning?
@jmeyer: Hello :wave: Is there any way to know / get notified (e.g. kafka message) when a message is done being ingested and available for query ?
@mayanks: There isn't at message level at the moment. When segment is committed, the Kafka offset is stored in segment ZK metadata.
@mayanks: May I ask, how you are planning to use that information?
@jmeyer: I see, thanks for the answer @mayanks The idea, is that Pinot consumes some messages which will affect query results (pretty logical ^^) - but some other components relies on theses results and *pre-computes* other results outside Pinot (typically a Node JS micro service) Therefore the need is to update the pre-materialized view whenever Pinot results will change, i.e. whenever a new message is ingested We're considering listening to the original message (the one consumed by Pinot) as a trigger for recomputation - but then we'd have a race condition between this consumer and Pinot
@jmeyer: Ideally this pre-materialized view wouldn't exist and be purely computed by Pinot, but that is unfortunately not the case
@jmeyer: But even then, that would be implemented as sub queries, which are not currently supported - so not sure that'd solve the problem
@jmeyer: Sorry for the wall of text :no_mouth:
@mayanks: May be create a column that has some counter/offset that could be used for this synchronization?
@jmeyer: Periodically check that the counter hasn't moved, if so, recompute using latest data ?
@mayanks: I was thinking more like get the counter value that you want to synchronize on (perhaps from the apps side), and the add filter on pinot query `where counter <= x)`?
@jmeyer: How would this ''get the counter'' be triggered ? Periodically ? Using the same message Pinot will / has (hopefully) already ingested ?
@mayanks: It is a high-water-mark of sorts that your other systems need to maintain that want to synchronize with Pinot (sorry if I am making incorrect assumptions)
@jmeyer: So the other system would periodically check if this high water mark is greater than it's latest computation - if so, fetch updated results ? (I've omitted quite a few details, no worries :slightly_smiling_face:)
@jmeyer: Maybe some useful information: New messages won't affect all pre-computations, meaning the recomputations must be partial (e.g. per userId)
@zeke.dean: @zeke.dean has joined the channel
@elon.azoulay: We have an upsert table and are trying to use pool based instance assignment. Only 1 instance from each pool contains all the consuming segments and we get duplicate rows if we set `COMPLETED` segments to have `numInstancesPerPartition` = 0 (so it can use all instances). Is upsert compatible with pool based instance assignment?
@elon.azoulay: Here is the instanceAssignmentConfig: ```"routing": { "instanceSelectorType": "strictReplicaGroup" }, "upsertConfig": { "mode": "FULL" }, "instanceAssignmentConfigMap": { "CONSUMING": { "tagPoolConfig": { "tag": "UpsertAnalytics_REALTIME", "poolBased": true, "numPools": 3 }, "replicaGroupPartitionConfig": { "replicaGroupBased": true, "numReplicaGroups": 3, "numInstancesPerPartition": 1, "numPartitions": 1 } }, "COMPLETED": { "tagPoolConfig": { "tag": "UpsertAnalytics_REALTIME", "poolBased": true, "numPools": 3 }, "replicaGroupPartitionConfig": { "replicaGroupBased": true, "numReplicaGroups": 3, "numInstancesPerPartition": 0, "numPartitions": 1 } } }```
@elon.azoulay: We also tried `numPartitions` = # of kafka partitions for `COMPLETED` segments (ex. 12) and the completed segments evenly spread across instances but we get duplicate records.
@elon.azoulay: We're using pinot 0.8.0 release and this is for a realtime only upsert table
@elon.azoulay: cc @jackie.jxt lmk if we should not use the pool based assignment with upsert. Thanks!
@elon.azoulay: cc @mingfeng.tan
@jackie.jxt: @elon.azoulay In order to have upsert work, all the segments from the same partition must be served from the same instance. In other word, the `COMPLETED` segment cannot be relocated to another instance
@jackie.jxt: Upsert is compatible with pool based instance assignment, but it must have `numInstancesPerPartition` set to 1, and should not relocate the `COMPLETED` segments
@elon.azoulay: Thanks! We tried that but noticed that the segments were only on 1 instance in each pool
@elon.azoulay: i.e. we had 3 instances per pool and 2 of them were empty.
@elon.azoulay: i.e. when we tried `numInstancesPerPartition` = 1 then pool0-instance0, pool1-instance0, pool2-instance0 had all segments, pool0-instance1, pool0-instance2, pool1-instance1, pool1-instance2 and pool2-instance1, pool2-instance2 had no segments.
@weixiang.sun: @jackie.jxt Do you have any document about why it is designed in this way for upsert table?
@elon.azoulay: Hopefully it's our config. @jackie.jxt does the config above but with COMPLETED `numInstancesPerPartition` = 1 look correct?
@elon.azoulay: That's the config that had all segments on only 1 node in each pool, leaving the others empty
@elon.azoulay: what's the minimum # of nodes used in typical configs? we are trying in staging with 9 nodes, 3 are in each pool 0,1,2 respectively
@jeff.moszuti: @jeff.moszuti has joined the channel
@ssawyer: @ssawyer has joined the channel

#pinot-dev

@kishansairam.adapa9: @kishansairam.adapa9 has joined the channel

#getting-started

@krtalic.vedran: @krtalic.vedran has joined the channel

#debug_upsert

@weixiang.sun: @weixiang.sun has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org