Apache Pinot Daily Email Digest (2022-04-13)

Pinot Slack Email Digest Wed, 13 Apr 2022 19:00:41 -0700

#general

@liandycg_slack: @liandycg_slack has joined the channel
@nikhil.varma: @nikhil.varma has joined the channel
@janardhan.bodu: @janardhan.bodu has joined the channel
@lars-kristian_svenoy: Hello team :wave: Any chance we could publish linux/arm64 images for pinot? I see we've started doing that in 0.11.0, but 0.10 and below do not support that architecture. I'm running into problems running pinot locally on the Mac M1 due to the chipset.
@kharekartik: Hi, We can do that. For current needs, you can build pinot from source on M1. Add this in your `~/.m2/settings.xml` ```<settings> <activeProfiles> <activeProfile> apple-silicon </activeProfile> </activeProfiles> <profiles> <profile> <id>apple-silicon</id> <properties> <os.detected.classifier>osx-x86_64</os.detected.classifier> </properties> </profile> </profiles> </settings>``` and then run the following from pinot source directory `mvn clean package -DskipTests -Pbin-dist`
@navina: @kharekartik can we document this in the pinot website?
@kharekartik: yes, will add it
@francois: Hi :slightly_smiling_face: Little question comming with prod getting closer and realData :smile: A few things goes wrong. I’ve got two table reading the same kafka topic. Both of them are using a complexTypeConfig to unnest 30 days arrays. And I gettin an infinite loop error ```java.lang.RuntimeException: shaded.com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: org.apache.pinot.spi.data.readers.GenericRow["fieldToValueMap"]->java.util.Collections$UnmodifiableMap["$MULTIPLE_RECORDS_KEY$"]->java.util.ArrayList[0]->org.apache.pinot.spi.data.readers.GenericRow["fieldTo> at org.apache.pinot.spi.data.readers.GenericRow.toString(GenericRow.java:247) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at java.util.Formatter$FormatSpecifier.printString(Formatter.java:3031) ~[?:?] at java.util.Formatter$FormatSpecifier.print(Formatter.java:2908) ~[?:?] at java.util.Formatter.format(Formatter.java:2673) ~[?:?] at java.util.Formatter.format(Formatter.java:2609) ~[?:?] at java.lang.String.format(String.java:2897) ~[?:?] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:543) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:420) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:598) [pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at java.lang.Thread.run(Thread.java:829) [?:?]``` TransformConfig as Folow -> ``` "complexTypeConfig": { "fieldsToUnnest": [ "data.attributes.regularTimes" ], "delimiter": ".", "collectionNotUnnestedToJson": "NON_PRIMITIVE" }``` The other table as the same complexTypeConfig but based on another field. Any idea ?
@mayanks: @jackie.jxt
@francois: I will try to increase the Xss size on the server side to avoid that :confused: Getting 39k messages with at least 30days in each to unnest not sure he like it :smile:
@francois: reducing reading rate made the tricks using “topic.consumption.rate.limit” : “2",
@mayanks: Hmm how big is each event and how many levels deep is the nesting
@francois: Big : 30days per event and 27 cols
@mayanks: What does 30days per event mean
@francois: an array with 30 days in a single event I’m Unnesting
@mayanks: Ok. What is the event rate?
@francois: For now it’s quite big because lot (39k) message in kafka queue. But expected to be 3 to 4 per seconds
@francois: keep failling even with slow message rate :(
@mayanks: Any reason of having 30 days worth of data in one event? That seems like an anti pattern
@francois: Yes 30 days is linked to a toplevel object summing a few things
@francois: It's a time report.
@francois: May using more partition help ? Only two partitions now :)
@mayanks: Your ingestion rate is really low, not sure if that will help
@mayanks: Can the upstream not flatten the events to be 1 row? Also, do I understand it right, one event has an array of 30 elements, and each element is a row of 27 columns?
@francois: Yes you are right
@mayanks: If so, it doesn’t seem terribly bad. The root cause of infinite might be something different, and worth filing an issue
@mayanks: If you can provide a sample payload with the issue that can help reproduce it, we should be able to identify the root cause, and hopefully fix it
@mayanks: Can you file an issue and paste a link here?
@francois: I will look for a bad message an try to reproduce the issue on my local instance.
@mayanks: Sounds good, thanks
@janardhan.bodu: Hi team @mayanks @g.kishore @xiangfu0..Im doing a POC to use Pinot where we have some 70 tables(in older database) and majorly 10 to 15 tables are queried currently with joins to get aggregations and analytics of the system. Our data is growing at fast pace and wanted to check if pinot satisfies our needs. I wanted to use presto with Pinot, with existing join queries and wanted to minimize new data modelling for pinot. I found some bench marks here() for presto + pinot but it was only for single table(name- complexWebsite, with billion records). Merging/Modelling all columns from our tables to single table(to satisfy without joins) is very difficult since we have many field dependencies. Do you have any reference links where I can find kind of similar above benchmarking with multiple tables(with joins) queried from presto to pinot. Can anyone help me in this aspect on how to proceed..Thanks in advance..
@mayanks: The number of tables shouldn’t really impact join performance. @yupeng for any data points

#random

@liandycg_slack: @liandycg_slack has joined the channel
@nikhil.varma: @nikhil.varma has joined the channel
@janardhan.bodu: @janardhan.bodu has joined the channel

#feat-text-search

@francois: @francois has joined the channel

#troubleshooting

@liandycg_slack: @liandycg_slack has joined the channel
@nikhil.varma: @nikhil.varma has joined the channel
@saumya2700: Hi everyone, we have realtime tables and data ingestion is happening from kafka, but our query performance is very low even with around in total we have 13 lkhs of data. query time is 17 secs, we have 1 tenant, 1 broker 2 server. Do we need to create indexes separately or it is done by default on columns because i saw some indexes are created. Also is there a option that we can create segments as per kafka - topic key. We are usually doing query on timestamp based and id , and our kafka topics have id as key.
@mayanks: do you have query response metadata you can share?
@saumya2700: "numServersQueried": 2, "numServersResponded": 2, "numSegmentsQueried": 65, "numSegmentsProcessed": 18, "numSegmentsMatched": 13, "numConsumingSegmentsQueried": 10, "numDocsScanned": 10603, "numEntriesScannedInFilter": 391632, "numEntriesScannedPostFilter": 137839, "numGroupsLimitReached": false, "totalDocs": 1220116, "timeUsedMs": 87, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {}, "numRowsResultSet": 5000, "minConsumingFreshnessTimeMs": 1649763344790
@mayanks: Ok, one thing I can see is that too much data is being scanned `391632` , so you probably need to have some indexing setup
@mayanks: But 17s is too much. So the next questions a) What is the query b) what is the cpu/mem for servers c) How many segments
@mayanks: Wait `"timeUsedMs": 87,` this is the time Pinot used to compute the query
@mayanks: Where are you seeing 17s?
@mayanks: If client side, then my guess is that the response is big and your JSON deser is the bottleneck
@saumya2700: yes I do have json fields and can you please tell me is there a way we can add all data related to one key in same segment. Or in other words can we create segments as per kafka topic key.
@mayanks: No I am not talking about json fields. I am saying your query took 87ms and not 17s
@saumya2700: yes Mayank from pinot query console it is taking some ms but from sqlachemy in our python app it is taking 15 secs when we are not doing any other things just querying the data.
@mayanks: That would be a sqlalchemy issue. My guess is it is spending time in deserializing the response.
@mayanks: What is the Pinot client you are using?
@saumya2700: pinotdb and sqlalchemy
@saumya2700: Mayank thank you for your support , it seems my local network messed up with VPN , I run same code in server and it is giving result in ms
@zliu: Hi everyone, How to configure Kafka SSL for Kafka Clients in pinot?
@navina: Here is an example of how to talk to kafka with ssl - did you have a specific question?
@zliu: thanks
@janardhan.bodu: @janardhan.bodu has joined the channel
@erik.bergsten: Hi! We are trying to use tiered storage with an NFS volume mounted on "server-b". When we trigger the rebalance and segments move from server-a to server-b we get alot of errors like: ```Caused by: java.nio.file.FileSystemException: /var/pinot/server/data/index/environment_OFFLINE/environment_OFFLINE_1618208070664_1649743939567_7/v3/.nfs000000000134004000000058: Device or resource busy``` in the logs from server-b. Could this be a problem with how the server is implemented or is it strictly an NFS problem on our end? The end result is that some or all segments go into an error state and the data goes missing during a rebalance.
@dlavoie: Seems like a linux mounting / NFS issue. Pinot could be responsible of overloading the NFS service it’s not scaled as it needs to be.
@mayanks: Do server-a and server-b share the same NFS? And if so what’s the dataDir specified in this server? Wondering if both are trying to overwrite each other

#pinot-dev

@dadelcas: hey there, I've raised this issue which I've already started looking into. Once I've have some code to show I'll get back to you on this channel
@dadelcas: @mayanks
@mayanks: Thanks @dadelcas

#presto-pinot-connector

@liandycg_slack: @liandycg_slack has joined the channel

#pinot-perf-tuning

@francois: @francois has joined the channel

#getting-started

@liandycg_slack: @liandycg_slack has joined the channel
@nikhil.varma: @nikhil.varma has joined the channel
@fizza.abid: Hello, can anyone tell how can we connect spark streaming to Apache pinot?
@mayanks: The spark connector to Pinot is not production ready (needs volunteer to take it to completion). May I ask what’s the use case?
@janardhan.bodu: @janardhan.bodu has joined the channel

#releases

@francois: @francois has joined the channel

#complex-type-support

@francois: @francois has joined the channel

#pinot-docsrus

@francois: @francois has joined the channel

#pinot-trino

@francois: @francois has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2022-04-13)

#general

#random

#feat-text-search

#troubleshooting

#pinot-dev

#presto-pinot-connector

#pinot-perf-tuning

#getting-started

#releases

#complex-type-support

#pinot-docsrus

#pinot-trino

Reply via email to