Apache Pinot Daily Email Digest (2020-11-11)

Pinot Slack Email Digest Wed, 11 Nov 2020 18:01:09 -0800

#general

@hackinguall: @hackinguall has joined the channel
@tymm: @tymm has joined the channel
@tymm: Hello, I followed this guide : and have successfully setup the cluster in docker on my ubuntu VM. However, once I restart the docker service/ reboot the ubuntu VM, i found out broker log has this error: `No server found for request 8: select * from sitevisitordata limit 10` please advice. thanks
@g.kishore: does your ip change?
@fx19880617:
@joao.comini: Hello guys! I'm here to talk about the use case that my team and I are facing right now. We have a realtime data processing platform that provides general data aggregations (sums, counts, avg, etc.) to a transactional fraud detection engine. In summary we consume data from Kafka topics, and with this data we increment some counters for a given ID (e.g. a credit card hash) and then we update it inside the database. This solution have a really great performance as we are accessing key-value indexed columns, but we have a really though time creating new pre-aggregation flows. So we thought: "what if we had a self-service data aggregation engine, where we didn't need to code every step of every pre-aggregation flow?". And here we are! We've been looking to Pinot for a long time now, and we're still not sure if it is going to fit our scenario. The major problems that we have today is: • Today, it takes us approximately 2 or 3 days to deliver a new pre-aggregation flow. • Our pre-aggregation algorithm have some kind of an "imprecision" as we work with a big time-window inside our aggregation technique. I want to give you some metrics here, so maybe you can help me think (or not) if Pinot can be suitable to us: • Our fraud detection engine runs, at its peak, a throughput of 7~8k transactions per minute. • For each transaction we make dozens (if not hundreds) of requests to our pre-aggregation platform, which gives us a throughput of ~100k queries per minute. • Our pre-aggregations lantecy SLA is 1 second to return *ALL* queries. I'm just putting it here to discuss similar use cases and understand whether the team's effort in starting something new or maintaining what already exists is worth it. I apologize if this is not the best way to introduce myself and start an discussion here. :smile:
@mayanks: Hello
@mayanks: Are you trying to pre cube all data and store in KV store? With Pinot, you don’t need to pre cube, you can simple ingest the records (with count of 1), and Pinot can aggregate at read time
@joao.comini: Yes, that is what we are doing right now. We are not using Pinot yet. We read about the star-tree index and looks like it is exactly what we are looking for.
@mayanks: Pinot can ingest at much higher rates than what you currently have. Also it is optimized for read time aggregations. From what I gather for your description, it seems like a good use case for Pinot.
@mayanks: Happy to chat further to assist
@joao.comini: Nice, thanks! I think that the best way to answer my questions is building a Proof of Concept, so I can bring the results here and open a new thread if needed.
@mayanks: Awesome
@joao.comini: Given our scenario, is it possible to have an idea of what filesystem would be better to our deep storage? Or is there not much performance difference between them?
@mayanks: Folks have used common ones like S3, GCS etc
@mayanks: Deepstore is not in the read path, only in ingestion path
@joao.comini: Oh, you're right!
@joao.comini: Ok, that's enough for today, thank you!
@mayanks: :+1:
@g.kishore: do you have rough schema of the topic or the table you plan to create?
@joao.comini: That's a simplified DTO that we use inside our pre-aggregation system, i think that the table schema would look the same ```data class Payment( val id: String, val orderId: UUID, val value: BigDecimal?, val status: String, val customerId: UUID, val createdAt: Instant?, val cardCvvHash: String?, val cardHash: String?, val deviceId: String? )```
@g.kishore: and the query patterns
@joao.comini: All queries have a similar pattern, something like this: ```SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND created_at >= NOW() - interval 1 day --------------- SELECT count(id) from payments WHERE customer_id = 'some-id' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day```
@g.kishore: ok, then make customer_id as the sorted index
@g.kishore: you may not really need star-tree index
@joao.comini: Ok, but then we might combine all Ids like: ```SELECT SUM(value) from payments WHERE customer_id = 'some-id' AND device_id = 'other-id' AND created_at >= NOW() - interval 1 day```
@joao.comini: Or just make queries about the card hashes ```SELECT count(id) from payments WHERE card_hash = 'some-hash' AND status = 'REFUSED' AND created_at >= NOW() - interval 1 day```
@mayanks: `card_hash` can have inverted index on it, when data sorted on customer-id
@mayanks: You can pick the column that appears in most queries (like customer_id), as sorted. For other queries where this column does not appear, we can have inv index on some of them (which prune most of the rows).
@joao.comini: Right, that looks good
@karinwolok1: I actually think this was a GREAT introduction. :smile:
@rhafik.gonzalez: Hi, guys! Thank you so much for all the support! João and I, we intend to run some proofs of concept soon and we will be very happy to continue keep in touch sharing our experiences and struggles!
@g.kishore: :+1:
@renatomoura: @renatomoura has joined the channel
@lxu1: @lxu1 has joined the channel

#random

@hackinguall: @hackinguall has joined the channel
@tymm: @tymm has joined the channel
@renatomoura: @renatomoura has joined the channel
@lxu1: @lxu1 has joined the channel

#troubleshooting

@pradeepgv42: Hi, I added an OFFLINE version to an existing REALTIME table, when I try to query the offline table using select * from table_OFFLINE I get BrokerResourceMissing error, tenant is same as the broker server I have. So wondering anybody has any ideas on what’s going on ?
@jackie.jxt: Can you please check the `brokerResource` ideal state and see if the table exists?
@pradeepgv42: Thanks @jackie.jxt for the help, needed to update the schemaName to be same as tablename.
@elon.azoulay: We have users that want to query offline data that is newer than 24 hours old using the broker (for aggregations) - is there any way to remove the time boundary restriction for offline tables or would it be that involved to change it (add a config, etc.)? lmk if there's a workaround or if someone can point me to the code.
@ssubrama: You can specify the table name in the query with the `_OFFLINE` suffix. The broker will skip querying realtime servers
@elon.azoulay: Thanks so much! That's a great idea:)
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2020-11-11)

#general

#random

#troubleshooting

Reply via email to