Apache Pinot Daily Email Digest (2022-03-31)

Pinot Slack Email Digest Thu, 31 Mar 2022 19:01:37 -0700

#general

@ysuo: Hi, I’ve a question. What’s the name of consumer group in kafka when pinot ingests the kafka stream data?
@mayanks: Pinot doesn’t use consumer groups for consumption
@akumar: Hi, I am going to setup apache-pinot for production on AWS EKS and I want to setup multi AZ setup So how would I manage the data because ebs volume support only in single AZ. Can we use efs ? Please suggest Setup should be scaleable and fault tolerance Thanks
@mayanks: I recommend that a single pinot cluster does not span across AZ, You can replicate the cluster in different AZ by replicating the data pipelines, for DR.
@mayanks: Within AZ, you can still have replication on table for scalability as well as fault tolerance
@akumar: Thanks @mayanks, Is there any document which guide me to setup multi AZ data replication
@diogo.baeder: I think he meant replication in a single zone, like on multiple nodes. For multiple zones though you could manually structure your data ingestion to be done at all the desired zones. For example publishing events to multiple Kafka instances.
@mayanks: Yes, that’s what I meant. For streaming, something like mirror maker will replicate Kafka. For offline ingestion, the same pipeline can push data to multiple Pinot clusters in different AZs
@mayanks: cc: @mark.needham
@ysuo: If I want to run pinot-admin.sh StartController command using args instead of using a config file. How can I set properties like controller.host, controller.data.dir, etc?
@kharekartik: Yes, that is possible. You can see the supported CLI configs here -
@ysuo: what about pinot.server.instance.id property?
@kharekartik: That is not possible I think. You will have to use config file for that. Can I understand why do you want to avoid a config file? Is it due to limited access to environment, security reasons or something else?
@ysuo: It’s not a problem if I build a Docker image and put the config file in it. Otherwise, I have to use data volume.
@ysuo: I think if it’s not necessary to set instanceId if hostname and port is already set. So, it’s no longer a problem for me. :sweat_smile:. Thanks @kharekartik
@kharekartik: Yeah, it is not necessary. By default, it will use host and port
@bsharma: @bsharma has joined the channel
@jfarrelly: @jfarrelly has joined the channel
@diana.arnos: Is there something similar to `RealtimeToOfflineTask` for an upsert table?
@mayanks: Upsert feature works on real-time only tables as of now
@diana.arnos: Thanks. So I need to find another way to relieve memory pressure =/
@yupeng: there are some practices on upsert table mem management:
@dmansh: @dmansh has joined the channel
@ysuo: Hi team, I’am a little confused. My table stopped consuming data from a kafka topic. But a new table with the same schema and table can ingest data from the same topic. There’s new data consistently written to this Kafka topic. Any idea why this is happening? I’ve tried and restarted controller, broker, and server nodes, but it didn’t help.
@mayanks: Check debug endpoint and server logs

#random

@bsharma: @bsharma has joined the channel
@jfarrelly: @jfarrelly has joined the channel
@dmansh: @dmansh has joined the channel

#troubleshooting

@kaushalaggarwal349: can anyone help with this?
@mayanks: What version of java?
@kaushalaggarwal349:
@mayanks: Hmm this is a jvm crash SIGSEGV. Your logs don’t say much.
@kaushalaggarwal349: is this a usual thing? such high CPU usage?
@mayanks: No. Did you create a realtime table and point it to a kafka stream ingesting from smallest?
@diana.arnos: I'm facing the same issue. I'm creating a new upsert table that's consuming a pretty loaded kafka topic. I noticed that servers die when using around 70% of available memory. I still didn't find a nice way to deal with this :cry:
@mayanks: There’s a config to throttle ingestion cc: @walterddr.
@mayanks: @moradi.sajjad have we documented the config, if so could you share the link?
@walterddr: I think it is released correctly (as in the release note). but let me find the doc
@walterddr: nope. it is not documented in pinot-docs. creating a PR
@moradi.sajjad: @walterddr if you haven't started adding the docs, I can quickly add it here:
@walterddr: yeah I haven't please add if you have acces
@moradi.sajjad: It's updated now.
@bajpai.arpita746462: Hi All, I am trying to read data from avro file to Pinot table using Batch Ingestion. I am facing error for STRING datatype . For now, I am providing null for the STRING field in avro file input and it is giving me below error: java.lang.RuntimeException: Caught exception while extracting data type from field:abcd Caused by: java.lang.UnsupportedOperationException: Unsupported Avro type: NULL In schema I have provided below config for the field : { "name": "abcd", "dataType": "STRING", "defaultNullValue": "none" }, I even tried without "defaultNullValue", for both am getting the same error mentioned above. I am passing value of field "abcd" as null in AVRO input file Can anyone help me with the same?
@mayanks: Looking at the code, it seems that you defined the type of data in AVRO as NULL instead of STRING
@bsharma: @bsharma has joined the channel
@ryantle1028: hi team i try to enable tls-ssl follow this link --> i found some issue. 1.after i start server why port 8098 http still running. #tls/ssl pinot.server.tls.keystore.path=/data/apache-pinot/cert/poc-pinot02.server.keystore.jks pinot.server.tls.keystore.password=hellorealtime pinot.server.tls.truststore.path=/data/apache-pinot/cert/poc-pinot02.server.truststore.jks pinot.server.tls.truststore.password=hellorealtime #pinot.server.netty.enabled=true #pinot.server.netty.port=8098 pinot.server.nettytls.enabled=true pinot.server.nettytls.port=8089 pinot.server.adminapi.access.protocols=https #pinot.server.adminapi.access.protocols.http.port=8097 pinot.server.adminapi.access.protocols.https.port=7443 2. after i create table i can not query via webUI error is --> ProcessingException(errorCode:450, message:InternalError: java.net.SocketException: Unexpected end of file from server at java.base/sun.net.(HttpClient.java:866) at java.base/sun.net.(HttpClient.java:689) at java.base/sun.net.(HttpClient.java:863) at java.base/sun.net.(HttpClient.java:689)) Anyone enable tls-ssl (production) please share me an example config.
@francois: Hi :slightly_smiling_face: I’ve a misunderstanding on something that give me a lot of trouble. I’ve a neasted array on my JSON. I’ve managed to flaten it using the complexTypeConfig. Sound all good and generate me all row with correct values. But I’m tring to rename the genarated cols using eiter a groovy function or a JSON path string but none of them are working :disappointed: Ingestion. ``` "ingestionConfig": { "transformConfigs": [ { "columnName": "type", "transformFunction": "JSONPATHSTRING(data,'$.type')" } ], "complexTypeConfig": { "fieldsToUnnest": [ "data.attributes.actualExpenses" ], "delimiter": "." } },``` Schema is already defined with type STRING
@francois: Output
@francois: I’ve also tried with groovy function same result :disappointed:
@francois: I’ve the strange feeling of an ordering of the transformConfig based on someting like transformConfigs -> complexTypeConfig. So all my transformConfig are useless and return null values. I’m I right ?
@mark.needham: Do you have a field called 'data' in the schema?
@mark.needham: I think that's what the transform fn is trying to do - read `.type` from `data`
@francois: yes. My only goal is to rename data.type to type. Working like a charm if no unnest. Crash null values if unnest
@mark.needham: oh I see
@francois: Plus some other modification on transformConfig to parseDateToTimestamp etc ... but all return null :confused: Like if the field does not exist
@mark.needham: ``` "ingestionConfig": { "transformConfigs": [ { "columnName": "type", "transformFunction": "\"data.type\"" } ], "complexTypeConfig": { "fieldsToUnnest": [ "data.attributes.actualExpenses" ], "delimiter": "." } },```
@mark.needham: give this a try
@francois: This is working :slightly_smiling_face:
@francois: So I have a “” that I need to proctect :slightly_smiling_face:
@mark.needham: you might not need the inside quotes. I put them there just in case
@francois: ok thank you for your help :slightly_smiling_face:
@francois: This works like a charm but lead to another issue maybe linked. ```Select * from mytable``` Perfect it works ```Select "mycol.otherName" from my table``` Throws ```[ { "message": "UnknownColumnError:\norg.apache.pinot.spi.exception.BadQueryRequestException: Unknown columnName 'actualExpenses.activityType' found in the query\n\tat org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.getActualColumnName(BaseBrokerRequestHandler.java:1604)\n\tat org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.fixColumnName(BaseBrokerRequestHandler.java:1538)\n\tat org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.updateColumnNames(BaseBrokerRequestHandler.java:1421)\n\tat org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleSQLRequest(BaseBrokerRequestHandler.java:254)", "errorCode": 710 } ]```
@mark.needham: that column definitely exists?
@francois: yes
@francois:
@mark.needham: hmmm, not sure
@francois: not sure of ? The existence of the column ?
@mark.needham: not sure why it's not working!
@francois: ah :slightly_smiling_face:
@mark.needham: @mayanks might know
@mayanks: Are these two separate columns or is this a nested JSON column? If latter, you probably want to use JSON functions.
@francois: Two separates columns but all on the same neasted structure inside data.attributes.actualExpenses. By json function you mean ? I need to aggregate a few meyrics stored in this structure ( 30 row per paylod)
@mayanks: Wait, are you trying to rename a column of an existing table? That would be a backward incompatible change, and is not allowed.
@francois: Nope brand new table :wink:
@mayanks: @jackie.jxt can we chain flatten and then rename of column today?
@jackie.jxt: Yes, flatten happens first before other transforms
@jackie.jxt: @francois Can you show the raw json response for the `select *` query? Want to make sure the column name matches
@jfarrelly: @jfarrelly has joined the channel
@luisfernandez: pinot does not use spring right? asking because of this
@dlavoie: Nope, the CVE is very specific to apps packaged as WAR and deployed in a standalone tomcat server.
@dlavoie: FYI, regular spring boot application built and executed as JAR are not impacted:
@dlavoie: > If the application is deployed as a Spring Boot executable jar, i.e. the default, it is not vulnerable to the exploit
@luisfernandez: yepp only if it’s tomcat not embedded
@luisfernandez: thank you for your answer :pray:
@dlavoie: but regardless, Pinot has no Spring dependency
@diana.arnos: The broker is failing to find the available servers for a lot of segments. This is the log message: ```Failed to find servers hosting segment: <segment> for table: <tableName>_REALTIME (all ONLINE/CONSUMING instances: [] and OFFLINE instances: [] are disabled, counting segment as unavailable)``` How can I: 1- Make the brokers find the segments? 2- If 1 is not possible, how can I make the servers download or fetch all the missing segments?
@mayanks: You can try: (swagger apis) ```- Rebuild routing table for broker: In case routing table got out of sync for some reason. - Reload api to reload segments - Use debug api to see if there are any issues it surfaces```
@mayanks: You can also check what external view says for these segments (if there are servers assigned to it in ONLINE state).
@diana.arnos: ```Rebuild routing table for broker: In case routing table got out of sync for some reason.``` I don't see this in swagger and I couldn't find in the docs also. swagger has the `tables/tableNameWithType/rebuildBrokerResourceFromHelixTags` But this one only runs if I update the broker's tags.
@luisfernandez: I also had this issue before and what I did is restart the servers
@abhinav.wagle1: Trying to run the QueryRunner class with following command, and am seeing. ```java -jar pinot-tool-launcher-jar-with-dependencies.jar QueryRunner -mode singleThread -queryFile test.q -numTimesToRunQueries 0 -numIntervalsToReportAndClearStatistics 5 -brokerHost <host-name> ```
@abhinav.wagle1: ```WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. ERROR StatusLogger Unrecognized format specifier [d] ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [thread] ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [d] ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [thread] ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass (file:/Users/awagle/DataPlatform/pinot/pinot-tools/target/pinot-tool-launcher-jar-with-dependencies.jar) to method java.lang.Object.finalize() WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release %d [%thread] %-5level %logger - %msg%nUsage: QueryRunner```
@abhinav.wagle1: Is there a doc available on how to use QueryRunner
@abhinav.wagle1: Am I doing something wrong by providing BrokerHost
@abhinav.wagle1: test.q ```select * from airlineStats```
@dmansh: @dmansh has joined the channel

#pinot-dev

@raluca.lazar: @raluca.lazar has joined the channel
@ken: Thought this was a very interesting article on fuzzing a DB to find bugs:

#getting-started

@bsharma: @bsharma has joined the channel
@bsharma: Hi - If I have a large set of S3 files in a object store, what is a good way to run query analysis on it? Can Pinot be used to solve this use case? The S3 files aren't fully static...they can change from time to time. Would love to be pointed at something
@bsharma: I came across this tutorial: . But wondering 1) What happens when the S3 files change data? Do we just re-run the Spark jobs to re-ingest the data on some cadence...there will be times when the data is stale, no? 2) In this tutorial, can Spark be replaced with say Flink ?
@mark.needham: If the S3 files change you could manually delete the segments from Pinot and reingest the data. Depending on what the data looks like the segments might actually get automatically replaced.
@mark.needham: At the moment batch ingestion supports Spark/Hadoop/standalone, so not Flink for now.
@jfarrelly: @jfarrelly has joined the channel
@dmansh: @dmansh has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org