Apache Pinot Daily Email Digest (2022-04-06)

Pinot Slack Email Digest Wed, 06 Apr 2022 19:00:48 -0700

#general

@avinash882: @avinash882 has joined the channel
@ashwinviswanath: Anybody know the differences in use cases between Pinot, druid, Clickhouse, and Rockset?
@mayanks: While all of them are the modern OLAP engines for real-time analytics, where Apache Pinot stands out from the rest is: ```- Battle tested at really high throughput and low latency (> 200k read qps, 1M events ingestion) at places like LinkedIn, Uber and many more. - Power of indexing (supports a wide variety of indexing schemes, including advanced ones like GeoSpatial, JSON, etc) - Capability to upsert (mutate single row) - Much easier to scale in/out, easier operations in general - Much lower cost to serve as compared to others.```
@mayanks: These are established by community users.
@arazdan: @arazdan has joined the channel
@siddhartha.varma: @siddhartha.varma has joined the channel
@chetan: @chetan has joined the channel
@nicolas.kovacs: @nicolas.kovacs has joined the channel
@ysuo: Hi team, when HTTP Basic Auth is enabled in Pinot 0.10, some swagger rest api calls return 403. Is there some property I can set to make sure these endpoints be accessed with admin account? Thanks.
@ysuo:
@francois: Swagger is not auth compatible. Use postman or another kind of tool to provide basic auth during request :wink:
@ysuo: Ok. Thanks.
@jadami: Swagger definitely supports basic auth, . The UI will come with a little lock that pops up a window to type the password in. Maybe submit an issue for this? The spec needs to have the security field set in this case
@francois: Sorry for the mispeak : swagger is not configured into Pinot to allow usage of basic auth :wink:
@prashant.pandey: Hi Pinot folks, we observed a peculiar incident today wherein consumption stopped from just 1 partition of a topic (this topic has 96 partitions, 95 are working fine). This segment was moved from CONSUMING to OFFLINE state due to some exception during consumption. ``` 0 2022/04/06 12:06:04.179 ERROR [LLRealtimeSegmentDataManager_span_event_view_1__50__287__20220406T1205Z] [span_event_view_1__50__287__20220406T1205Z] Exception while in work 1 2022/04/06 12:06:04.365 INFO [FileUploadDownloadClient] [span_event_view_1__50__287__20220406T1205Z] Sending request: dConsuming?reason=java.lang.NullPointerException&streamPartitionMsgOffset=1059610656&instance=Server_server-span-event-view-realtime-7.span-event-view-realtime-headless.pinot.svc.cluster.local_8098&of fset=-1&name=span_event_view_1__50__287__20220406T1205Z to controller: controller-0.controller-headless.pinot.svc.cluster.local, version: Unknown 2 2022/04/06 12:06:04.366 INFO [ServerSegmentCompletionProtocolHandler] [span_event_view_1__50__287__20220406T1205Z] Controller response {"isSplitCommitType":false,"streamPartitionMsgOffset":null,"build TimeSec":-1,"status":"PROCESSED","offset":-1} for ffset=1059610656&instance=Server_server-span-event-view-realtime-7.span-event-view-realtime-headless.pinot.svc.cluster.local_8098&offset=-1&name=span_event_view_1__50__287__20220406T1205Z 3 2022/04/06 12:06:04.366 INFO [LLRealtimeSegmentDataManager_span_event_view_1__50__287__20220406T1205Z] [span_event_view_1__50__287__20220406T1205Z] Got response {"isSplitCommitType":false,"streamParti tionMsgOffset":null,"buildTimeSec":-1,"status":"PROCESSED","offset":-1}``` I have attached the server logs when this happened.
@prashant.pandey: This segment is OFFLINE right now. Is there anyway to move it back to CONSUMING?
@richard892: what version are you using?
@prashant.pandey: 0.9.1
@prashant.pandey: It created the next segment with the same starting offset as the OFFLINE segment, and then consumption started happening again, as you can see.
@prashant.pandey:
@npawar: There's a periodic background task on the controller which fixes this. That's how you got the new segment. It runs every hour
@prashant.pandey: Ah. But does it fix OFFLINE segments as well? I thought it fixes only those segments that are in ERROR state.
@prashant.pandey: ```* Validates realtime ideal states and segment metadata, fixing any partitions which have stopped consuming, * and uploading segments to deep store if segment download url is missing in the metadata.``` Okay got it. But not sure why it suddenly errored out during consumption in the first place.
@kharekartik: @npawar the PR for updating ideal states during segment commit should take care of this in future, right?
@ssubrama: @prashant.pandey the offline segments will go away in a few days (default 7d I think). They only exist in idealstate, and do not have any data. They are there for debugging purpose so that you can search logs with that segment name and find out what went wrong. The automatic correction mechanism kicks in and creates new segments from the same offset (as long as that is still available in the stream).
@npawar: @kharekartik the existing ValidationManager logic is the one that’s expected to fix this (as it did). I don’t think the PR you’re referring to would’ve addressed this (as Prashant’s problem wasn’t about a new partition)
@prashant.pandey: @ssubrama thanks for the response. In any such future incidents, I guess we can trigger this job manually to fasten up the repair process. However, can we do something to know better what exactly went wrong during consumption? It says it’s a NPE but does tell exactly where.
@ssubrama: It should log a stack
@ssubrama: Our experience has been that on most occasions the OFFLINE transition is because of a transient failure in the underlying stream. There is logic to try multiple times, but it is possible that the retries fail as well. In that case, the segment turns to OFFLINE state, and new one is created at some point in future automatically to recover. Until that time, data will be stale on the partitions that had consumption issues.
@ssubrama: However, there have been extremely rare occasions where the input data is either invalid or unsupported in pinot, causing an exception. In this case, no matter how many times the new segments get created, the row in question will disrupt consumption. The only way (at the moment) is to wait until a this row is evicted from the stream (due to retention), and pinot will then automatlcaly find a new (more recent) offset to consume data from -- at the cost of data loss, of course. A metric is raised to indicate data loss.
@ssubrama: Either way, an exception stack should be logged on the server that attempted to consume the row. You can retrieve the stack via status APIs as well, I believe (not sure about this). That said, I have seen that if there are too many exceptions, jvm tends to log a single line after a few occurrences, so you may need to go back to locate the first exception stack log.
@nicolas.kovacs: Hello everyone, Im struggling to create a Kafka stream ingestion with HLC consumer and a custom consumer group id. Anyone has worked on a similar case ?
@nicolas.kovacs: I got this kind of error, but I can’t figure out why, the documentation is quite poor on this subject ```Caught exception while processing resource contactEvents_REALTIME, skipping. java.lang.StringIndexOutOfBoundsException: begin 0, end 8, length 7 at java.lang.String.checkBoundsBeginEnd(String.java:3319) ~[?:?] at java.lang.String.substring(String.java:1874) ~[?:?] at org.apache.pinot.common.utils.HLCSegmentName.<init>(HLCSegmentName.java:99) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f]```
@kharekartik: Hi can you share you table config and schema. Please remove any secrets from it if there
@nicolas.kovacs: Don’t worry, I’m using helm with env var config secret ``` { "tableName": "contactEvents", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "created_at", "timeType": "DAYS", "retentionTimeUnit": "DAYS", "retentionTimeValue": "3650", "segmentPushType": "APPEND", "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", "schemaName": "contactEvents", "replication": "1", "replicasPerPartition": "1" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant", "tagOverrideConfig": {} }, "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "highlevel", "stream.kafka.topic.name": "united-pg.public.contact_events", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.broker.list": "strimzi-kafka-bootstrap.strimzi.svc.cluster.local:9093", "stream.kafka.hlc.bootstrap.server": "strimzi-kafka-bootstrap.strimzi.svc.cluster.local:9093", "stream.kafka.hlc.zk.connect.string": "strimzi-zookeeper-client.strimzi.svc.cluster.local:2181", "stream.kafka.consumer.prop.auto.offset.reset": "earliest", "stream.kafka.hlc.group.id":"pinot", "security.protocol": "SSL", "ssl.truststore.type": "PKCS12", "ssl.keystore.type": "PKCS12", "ssl.truststore.location": "/ssl/ca.p12", "ssl.keystore.location": "/ssl/strimzi-user-pinot.p12", "ssl.truststore.password": "${SSL_TRUSTSTORE_PASSWORD}", "ssl.keystore.password": "${SSL_KEYSTORE_PASSWORD}" }, "bloomFilterColumns": [ "tenant_id", "contact_id", "content_id" ] }, "metadata": { "customConfigs": {} } }```
@nicolas.kovacs: I tried many things so maybe some config is redundant, but while looking on pinot source code I spotted this option `"stream.kafka.hlc.group.id"` I think It is causing an issue with the `HLCSegmentName` class
@kharekartik: Yes, that is the cause of the issue. Will it be possible for you to use `lowlevel` consumer? I am meanwhile looking for a fix. I guess just increasing the groupId to a bigger string should fix it but need to verify.
@nicolas.kovacs: In my case, I really need a consumer group so I can’t use `lowlevel` consumer type right now :confused: I will try with a bigger group id (`apachepinot` instead and let you know)
@kharekartik: The error seems to be in `PinotTableIdealStateBuilder` class. Working on alternative/fix.
@npawar: why do you particularly need a consumer group? HLC has been deprecated.. here’s some reasons why we did that:
@cyrusnew456: @cyrusnew456 has joined the channel
@dh258: Hello everybody. Does Pinot support data export out of Pinot via bulk extract or CDC?
@mayanks: Not at the moment, but this can be implemented easily via the grpc endpoint. Would you like to contribute?
@facundo.bianco: Hi All, do you know how Pinot stores data between brackets? (not JSON). Let me explain: I have this data ```id,timestamp,application 1,1649268351,"{'app_name': 'foo', 'version': '1.0.0', 'app_id': None, 'business': 'ponzico'}"``` And when I load that info I got ```| id | timestamp | application | |----|------------|-------------------| | 1 | 1649268351 | foo,1.0.0,ponzico |``` (In table-schema.json row "_application_" is configured as "STRING".) There is a way to query "_application_" row based on one of the values inside? (ie `SELECT * FROM testing WHERE application.app_name = "foo"`). Thanks in advance!
@g.kishore: whats the decoder you are using?
@facundo.bianco: It's a parquet file
@npawar: this is not how it is expected to be stored. can you share your schema and table config too?
@shyamalavenkatakrish: @shyamalavenkatakrish has joined the channel
@547535653: @547535653 has joined the channel

#random

@avinash882: @avinash882 has joined the channel
@arazdan: @arazdan has joined the channel
@siddhartha.varma: @siddhartha.varma has joined the channel
@chetan: @chetan has joined the channel
@nicolas.kovacs: @nicolas.kovacs has joined the channel
@cyrusnew456: @cyrusnew456 has joined the channel
@shyamalavenkatakrish: @shyamalavenkatakrish has joined the channel
@547535653: @547535653 has joined the channel

#troubleshooting

@avinash882: @avinash882 has joined the channel
@arazdan: @arazdan has joined the channel
@ysuo: Hi, I’m still testing the authentication on Pinot 0.10. With the following config, admin can log in, but user failed with tip of invalid username/password. I’m sure I use the same password as in the config. Any idea of the reason? Besides, with or without this config, #controller.segment.fetcher.auth.token=Basic YWRtaW46dmVyeXNlY3JldA, admin can log in. So what is it set for?
@francois: When I’ve deployed it in my server I’ve found special “hidden caracters” in my copy paste :wink: Check for them.
@ysuo: @francois Thank you. I’ll try and write it myself instead of copying it from the doc.
@ysuo: @francois It worked. Really thank you.:thumbsup:
@npawar: woah I would not have guessed that :laughing: thanks @francois!
@siddhartha.varma: @siddhartha.varma has joined the channel
@chetan: @chetan has joined the channel
@bajpai.arpita746462: Hi All, we are trying to pull all distinct records from our Pinot table along with pagination, but are unable to do so . Below are the details of our use case : In our Pinot table we have a "field_a" and "field_b" and every value in "field_a" is associated with multiple values in "field_b" field_a field_b aa 12 aa 13 aa 13 bb 45 bb 67 bb 78 We want all the unique or distinct combinations of field A and field B. That would be : field_a field_b aa 12 aa 13 bb 45 bb 67 bb 78 Also we have to run our query in batch of 1000 records at a time which we are trying to achieve through pagination, but it is not giving distinct records every time. Below is the sample query select field _a, field_b from table_name group by field_a, field_b limit 0,1000. Note:I was going through the Pinot docs and I found out that pagination does not work on Group By queries. Any suggestions on how we can achieve above use case ?
@npawar: that’s right, no pagination support in group by. you would have to increase the limit and paginate on your side. @jackie.jxt anything you’d like to add here?
@nicolas.kovacs: @nicolas.kovacs has joined the channel
@chetan: Hello team, we are evaluating Pinot and we setup Pinot on GKE with the helm chart version 0.2.6-SNAPSHOT. We created an ingestionJobSpec to ingest from GCS bucket and write segments to GCS as deep store. Although the ingestion job succeeds and there are no apparent error/warn logs in the job output and the controller logs and a segment tar file seemingly appears in the GCS output bucket, the Pinot console shows 0 bytes in the table even long after. What are we missing ? Another question is whether the `outputDirURI` in the `ingestionJobSpec` can be same as the `controller.data.dir` ? We are assuming yes. Config files, table config and job spec attached.
@nicolas.kovacs: Did you try the query console ?
@chetan: Yes - it show 0 bytes
@nicolas.kovacs: I have a case where it shows 0 bytes but making a query show results
@chetan: Even queries show 0 results
@chetan: I am able to get past this problem by two changes. One the `controller.data.dir` is pointed to `` and the `controller.local.tmp.dir`is pointed to some `/tmp/dir` on the pod instead of the persistent disk directory `/PDmountdir/some/dir` . I don't know how exactly these changes fixed it though
@andre578: Hey folks, I am trying to ingest CSVs from S3 with hourly `SegmentGenerationAndPushTask` via the Minions. All jobs keep failing with > `"INFO": "java.lang.RuntimeException: Failed to execute SegmentGenerationAndPushTask"` Can you help me find a means to trace and debug the issue? AFAIK it is not easy to monitor Minion tasks, but do you know of any hints towards where it may fail - at the configs level and/or parsing the files (although each individually seem to be valid)?
@andre578: Let me know if I can help you helping me, e.g. providing more info.
@npawar: Any exceptions in the minion logs before this message? Most common failure is due to minions config not being setup for S3 access (similar to controller and server config)
@npawar: this is the minion config i’m talking about
@andre578: Thanks @npawar for trying to help, much appreciated!!! With quite some tedious bughunting I was able to find the issue - apparently Groovy was set to disable, so some transformations failed.
@npawar: i see, yes groovy has been default disabled starting latest release. Was no exception logged in the minion/controller logs?
@andre578: Unfortunately not.
@andre578: But good to know, thanks! We are on a managed cluster, so access to logs and configs are a little non-trivial.
@arazdan: Hi Team, I am trying to read data from parquet file to Pinot table using spark batch Ingestion. I am facing error for date time STRING datatype. Here the date (‘yyyy-MM-dd ’) is getting loaded in EPOCH format (18234) whereas I need it in original string format with granularity : DAYS (2020-01-02). For now, I am using derived column method and transforming it into string using transformConfigs . With this, I am not longer able to use function like dateTrunc(‘week’ , sql_date_entered_str, ‘DAYS’) ``` { "name": "sql_date_entered", "dataType": "INT", "format": "1:DAYS:EPOCH", "granularity": "1:DAYS" }, { "name": "sql_date_entered_str", "dataType": "STRING", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }``` Other way to handle is using query transformations: ```select sql_date_entered , DATETIMECONVERT(dateTrunc('week' , sql_date_entered, 'DAYS'), '1:DAYS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', '1:DAYS') as week from kepler_product_pipegen``` Is there any way that I can load the date in ( ‘YYYY-mm-dd’) format and still run the transformation like dateTrunc on the top of it ? Pinot version = 0.7.1
@walterddr: have you try using `timestamp` dateType instead of STRING?
@chetan: I am able to get past this problem by two changes. One the `controller.data.dir` is pointed to `` and the `controller.local.tmp.dir`is pointed to some `/tmp/dir` on the pod instead of the persistent disk directory `/PDmountdir/some/dir` . I don't know how exactly these changes fixed it though
@alihaydar.atil: Hello everyone, is there anyway to make such a query which includes range filter, order by and high value limit,offset faster? I'd appreciate any help :) 'messageTime' is the dateTimeFieldSpecs column of the table. ```select fieldA, fieldB, fieldC, messageTime from mytable where messageTime >= 1648813002065 and messageTime <= 1649245002065 order by messageTime limit 3000000, 1000000 option(timeoutMs=120000)```
@walterddr: one config you can try is adding range index or even configure the sorted index to the messageTime field. but if you already use `messageTime` as your time column then the bottleneck is probably somewhere elase. could you share the query result metadata /stats returned from broker?
@cyrusnew456: @cyrusnew456 has joined the channel
@grace.lu: Hi team, when I was running the spark pinot batch ingestion with s3 parquet data as input , I noticed that the job will failed when there is an empty file / non parquet file in the input folder, but it’s very common for the upstream processed data output by spark to have a empty _SUCCESS marker file in the folder. I wonder if it is possible to let the ingestion job ignore these non parquet / empty files by changing some config, otherwise we will need to clean up the _SUCCESS file every time for pinot ingestion jobs.
@g.kishore: IIRC there is a file pattern to include and exclude
@luisfernandez: is ORDER BY multiple fields not working properly on pinot? ```SELECT product_id, COUNT(*) as views FROM table ORDER BY views, product_id DESC``` in regular RDBMS I would expect for it to order by the views first and then product id so something like ```views, product_id 3 3 3 2 3 1``` but pinot does something completely different is doing ```views, product_id 1 10 1 9 1 8``` anyone has an idea why this may be?
@diogo.baeder: The result seems correct to me, it's ordering by descending order on both columns, except that `views` has only value 1
@luisfernandez: but I'm specifying I want it to be sorted by views
@luisfernandez: first so I want products with higher counts up front
@diogo.baeder: Are you GROUPing BY?
@walterddr: upon checking the code. DESC applies to each individual column
@walterddr: e.g. the syntax that will achieve what you wanted is ```ORDER BY views DESC, product_id DESC```
@ken: Except isn’t `views` just a `COUNT(*)` without any group? So what would that mean? I think there’s a missing `GROUP BY product_id`
@luisfernandez: oooo so I just have to specify on both
@walterddr: i believe this is common practices upon searching on stackoverflow. i don’t know however whether this ANSI standard.
@walterddr: “this” as “specifying DESC/ASC on each column individually”
@luisfernandez: cool cool thanks I will test it out
@walterddr: but what @ken mention is another point to investigate, upon checking your SQL wouldn’t compile since it is unclear what you want to count(*) on
@luisfernandez: yea sorry I think I miss copied the query the group by is missing
@walterddr: that’s what i’ve guessed :laughing:
@shyamalavenkatakrish: @shyamalavenkatakrish has joined the channel
@547535653: @547535653 has joined the channel

#pinot-k8s-operator

@tozhang: Is there any advice on the configuration of JVM ops and resource for a prod quality?

#pinot-dev

@dadelcas: hi @richard892, @walterddr thanks for reviewing this PR I've now rebased it and moved the converters to the schema utils class. I've left a comment with regards regards to config that I thought it'd be easier to discuss over slack
@dadelcas: the conversions should be enabled on demand to avoid existing ingestions so I need to pass a config prop down to the record reader. I was wondering if anyone is working on this feature at the moment
@dadelcas: I was also wondering how the test data for these classes was generated so I can top up
@walterddr: looks good to me. i think you can create a AvroRecordExtractorConfig in the same PR.
@dadelcas: I've done this now but I can't figure out how this class is created for a realtime table
@dadelcas: For what I've seen it doesn't seem like this is possible at the moment, I may need to do some refactoring
@walterddr: i am not sure i understood the issue that would require a refactoring. would you mind commenting on the PR so we can take a look please?
@saurabhd336: @saurabhd336 has joined the channel
@cyrusnew456: @cyrusnew456 has joined the channel
@kharekartik: Has anyone here tried compiling pinot on M1 mac?
@mayanks: @saurabhd336 has @kharekartik
@jadami: i was unsuccessful a few weekends ago, but admittedly did not spend more than 15 minutes on it. I went back to an intel mac
@saurabhd336: @kharekartik @jadami the master branch should compile just fine now. The required pom changes were merged recently. Beyond that, I just made sure I used the right jdk (official openjdk 11 isn't available as such for M1 yet, so had to use ). It compiled for me with this.
@jadami: Nifty and ty for that. I'll give it another try this weekend

#announcements

@cyrusnew456: @cyrusnew456 has joined the channel

#getting-started

@avinash882: @avinash882 has joined the channel
@arazdan: @arazdan has joined the channel
@siddhartha.varma: @siddhartha.varma has joined the channel
@siddhartha.varma: hey! was trying to run pinot + thirdeye, but thirdeye docker image does not run I get `ERR EMPTY RESPONE` every time… has it not been updated?
@mayanks: @pyne.suvodeep ^^
@pyne.suvodeep: Hey @siddhartha.varma Can you share the error? OSS env has a few issues. We’ll be launch a bunch of updates end of the month. Meanwhile @cyril, would it be possible for you to share your fork/env?
@cyril: Hey @siddhartha.varma, can you try with this fork? You can go through the quickstart here: Once you’re good with the quickstart, pluging your Pinot instance should be a matter of updating the data source config:
@chetan: @chetan has joined the channel
@nicolas.kovacs: @nicolas.kovacs has joined the channel
@cyrusnew456: @cyrusnew456 has joined the channel
@cyril: @cyril has joined the channel
@shyamalavenkatakrish: @shyamalavenkatakrish has joined the channel
@547535653: @547535653 has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org