Apache Pinot Daily Email Digest (2021-11-02)

Pinot Slack Email Digest Tue, 02 Nov 2021 19:00:33 -0700

#general

@jurio0: @jurio0 has joined the channel
@priyam: @priyam has joined the channel
@vaibhav.gupta: Hi, I have a question regarding fields with with low cardinality. (fields like transaction_status). Needed to ask if we should be keeping those as integers in pinot and maintain enums on application end. or Can we keep the field values as string in pinot and pinot's forward index already optimizes for low cardinality string fields. Will the *lookup* and *aggregation* performance for string values be nearly same as with integer?
@g.kishore: Pinot will optimize it by creating a dictionary
@g.kishore: What kind of aggregation do you plan to perform on enum column
@vaibhav.gupta: These queries: select count(1), status from trips group by status select * from trips where status = 'Early'
@jackie.jxt: Yes, the performance should be fairly close (int is slightly faster) if you use dictionary encoding and add inverted index to the column
@ryan: @ryan has joined the channel
@diogo.baeder: Hi folks! I'd like to ask some questions about the next release (0.9.0): • Is it going to include any sort of table truncation, even if rudimentary? • Is there a rough estimation for when it will be released?
@mayanks: Hi @diogo.baeder, There's a delete all segments api you could use? For the release we have started the discussions already (so a few more weeks). Is there anything in particular you are looking for in the next release?
@diogo.baeder: Hey man! I tried using the "delete all segments" endpoint but that didn't seem to work well for me - after deleting them Pinot didn't automatically create one afterwards, leaving me with no way of ingesting new data. So I found this method to not be very reliable, and instead having some sort of "truncation" mechanism that is just a straightforward way of cleaning up a whole table would be the ticket for me. I was pointed to the code to try to look at how to create new segments manually but I found that to be very confusing, especially since my project is in Python and not Java, so it's harder to translate what has to be done.
@diogo.baeder: RDBMSs in general have this truncation feature which is beautiful for when we screw up with not-so-relevant data or are just testing stuff, and this is exactly what I need, but I don't really want to have to dig deep into Pinot's internals to be able to accomplish that.
@g.kishore: @ssubrama ^^
@g.kishore: @diogo.baeder can you please upvote the truncate issue, we discussed this multiple times but we did not decide on a solution because there were too many error conditions especially with real-time table.. maybe, we can start off with supporting truncate for offline tables
@ssubrama: It seems to me (from "after deleting them Pinot didn't automatically create one afterwards") that @diogo.baeder is trying to truncate a realtime table
@ssubrama: One way to do this is to drop the table and re-create it.
@ssubrama: But yes, please upvote the issue on table truncation.
@g.kishore: I think that one was also non-deterministic because deletion is asynchronous and segments might still exist
@diogo.baeder: I tried dropping the table and then recreate it, but this didn't work reliably either - after recreating the table I was getting multiple segments in it, which was unexpected. And this was in an integration test, so being done in a very controlled way, including checking when the table was deleted and when it was already created so that I could add data again.
@diogo.baeder: But I'll upvote, yes. Thanks, guys! :slightly_smiling_face:
@ssubrama: So, dropping a table takes a while. You need to make sure that all the segments are gone from the externalview. I agree this is not the best way to describe it. A better way would be for the "drop" command to wait until segments have disappeared from externalview before returning to the user. Once the table is full dropped, you should be able to recreate the table.
@diogo.baeder: Hmmm, got it. Well, that would be a great improvement, I like that. Because then the behavior would be more deterministic, IMO. If that gets implemented, it could be a "poor man's truncation" initially - at least from a testing perspective.
@diogo.baeder: The part that was most surprising for me was that I dropped a table, recreated it and it still had segments - as if the table was never deleted -, so this is the part I don't like. Therefore, making the table delete endpoint actually wait until it's all clear for that table sounds like a good idea to me. (Hopefully it doesn't timeout though.)
@ssubrama: That is a good thing to have, but must be designed in carefully. Maintaining state inside of the API means that the API should be idempotent. If the controller (for example) got restarted while it was in the middle of executing the API , then a second invocation of the same API should be able to take off where it left off. I think table drop is do-able in this fashion, I am not sure about the `truncate` API (see the notes in the issue).
@diogo.baeder: Ah, got it. Makes sense to me, good point.
@mingfeng.tan: @mingfeng.tan has joined the channel
@navi.trinity: @navi.trinity has joined the channel
@stuart.coleman: @stuart.coleman has joined the channel
@stuartcoleman81: @stuartcoleman81 has joined the channel
@alihaydar.atil: Hey Everyone :raised_hand_with_fingers_splayed: I just started using Pinot and loving it already! I am using stream ingestion with Apache Kafka to import data. I just can't get a grasp on deep storage concept. I don't have HDFS or cloud storage setup currently. •Do segments still get flushed to local disk periodically in the absence of a deep storage? •If i restart my Pinot Servers, would it recover old segments in the absence of a deep storage? •What is the main purpose of a deep storage in a Pinot cluster setup? •Is deep storage a must in a production Pinot cluster setup? I would appreciate if you could share your knowledge with me.
@mayanks: Welcome @alihaydar.atil : ```You can think of deep store as the persistent store to keep backup copy of the data ingested into Pinot. - Serving nodes flush data to "local" disk periodically, but that is their local copy. It goes through a "commit" protocol that involves saving a copy of the data in deep-store to consider the data committed into Pinot. - Local disk attached to Pinot servers is not viewed as a persistent store. However, that is where Pinot server first looks to load the data (that the Controller asks it to load in IdealState). Only if it doesn't find it locally, it will download it from deep-store. - As mentioned above, it is used as persistent copy of the data ingested into Pinot, for servers to download (note new servers may join the cluster), disaster recovery, etc - Yes, in a production setup it is recommended to have a storage that is shared across the controllers. It could be something like NFS, or something like S3, ADLS, GCS, etc.```
@alihaydar.atil: Thanks for your detailed answer @mayanks. The matter is clear to me now:blush:
@nsanthanam: @nsanthanam has joined the channel

#random

@jurio0: @jurio0 has joined the channel
@priyam: @priyam has joined the channel
@ryan: @ryan has joined the channel
@mingfeng.tan: @mingfeng.tan has joined the channel
@navi.trinity: @navi.trinity has joined the channel
@stuart.coleman: @stuart.coleman has joined the channel
@stuartcoleman81: @stuartcoleman81 has joined the channel
@nsanthanam: @nsanthanam has joined the channel

#troubleshooting

@jurio0: @jurio0 has joined the channel
@priyam: @priyam has joined the channel
@mapshen: for realtime segments, do we need to specify both `replication` and `replicasPerPartition`? Currently we only have `replicasPerPartition` set 2 but in the segment builds, their configs still show num of replicas as 1
@npawar: only replicasPerPartition should be enough. What do you mean by config showing num replicas as 1 in segment builds? are you ending up with only 1 replica in ideal state for CONSUMING/ONLINE segments?
@mapshen: Yes that’s what I meant. Do you know how/where to check the number of replicas in existence? Perhaps I was looking at the wrong place. It’s the PropertyStore->Segment in the Zookeeper Broswer that I checked.
@npawar: In that case, to unblock yourself just set it in `replication` as well. It definitely works when both are set. Separately we can investigate why it doesn’t work with just `replicasPerPartition`
@mapshen: i might be onto something. Seems whatever shown in the PropertyStore is not in effect. I went to check the IdealStates for this table and it just returned a bunch of garbled characters…And i found several tables had this problem
@mapshen: After dropping the table configs and recreating them, `replicasPerPartition` now works as expected….
@ryan: @ryan has joined the channel
@mingfeng.tan: @mingfeng.tan has joined the channel
@navi.trinity: @navi.trinity has joined the channel
@stuart.coleman: @stuart.coleman has joined the channel
@stuartcoleman81: @stuartcoleman81 has joined the channel
@stuartcoleman81: hey - i am trying to use the support for complex type added in 0.8.0 for avro - when i run the example command `bin/pinot-admin.sh AvroSchemaToPinotSchema -timeColumnName fields.hoursSinceEpoch -avroSchemaFile /tmp/test.avsc -pinotSchemaName myTable -outputDir /tmp/test -fieldsToUnnest entries` with the schema in the pr () i get an exception below - any idea what i am doing wrong? ```Exception caught: java.lang.RuntimeException: Caught exception while extracting data type from field: entries at org.apache.pinot.plugin.inputformat.avro.AvroUtils.extractFieldDataType(AvroUtils.java:252) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.plugin.inputformat.avro.AvroUtils.getPinotSchemaFromAvroSchema(AvroUtils.java:69) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.plugin.inputformat.avro.AvroUtils.getPinotSchemaFromAvroSchemaFile(AvroUtils.java:148) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.tools.admin.command.AvroSchemaToPinotSchema.execute(AvroSchemaToPinotSchema.java:99) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:166) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:186) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] Caused by: java.lang.IllegalStateException: Not one field in the RECORD schema at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:444) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.plugin.inputformat.avro.AvroUtils.extractSupportedSchema(AvroUtils.java:280) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] at org.apache.pinot.plugin.inputformat.avro.AvroUtils.extractFieldDataType(AvroUtils.java:247) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-c4ceff06d21fc1c1b88469a8dbae742a4b609808] ... 5 more```
@npawar: @yupeng ^
@nsanthanam: @nsanthanam has joined the channel

#getting-started

@bagi.priyank: @bagi.priyank has joined the channel
@bagi.priyank: hello, i am just getting started. i am trying to consume avro records from a 2.x kafka stream which doesn't use schema registry. does this look correct? ```"stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.KafkaAvroMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory"``` table status says bad in cluster manager and i am trying to figure out what i am missing. i am looking at the code in github, and seems like i need to provide schema for parsing however there is a comment saying not to use schema as it will be dropped in future release. any pointers will be greatly appreciated. thanks in advance!
@bagi.priyank: should i use `SimpleAvroMessageDecoder` ? even that one has the same comment ```Do not use schema in the implementation, as schema will be removed from the params```
@bagi.priyank: I am using version 0.7.1 with Java 8
@niteeshhegde: @niteeshhegde has joined the channel
@niteeshhegde: Hi, I am new to pinot Can I ingest data to pinot from postgres logs?
@bagi.priyank: this one didn't work either. :(
@npawar: if you done have a schema registry, you need to provide schema as a config in the stream config @bagi.priyank
@bagi.priyank: what does providing table schema do then? i confused myself by thinking of it as the kafka record schema.
@npawar: table schema is for Pinot. The avroSchema.toString() value that you would set i the streamConfig, will be used by the SImpleAvroMessageDecoder to decode the payload from kafka
@npawar: you can read `SimpleAvroMessageDecoder` code if you’re interested in seeing exactly where this is happening ```@Override public void init(Map<String, String> props, Set<String> fieldsToRead, String topicName) throws Exception { Preconditions.checkState(props.containsKey(SCHEMA), "Avro schema must be provided"); _avroSchema = new org.apache.avro.Schema.Parser().parse(props.get(SCHEMA));```
@npawar: we literally read the property and parse it to acro schema and then use it in decoding
@bagi.priyank: Yeah I saw that yesterday. For some reason I believed that somehow table schema was providing that schema and I just needed to figure out how to configure it :joy:
@npawar: `"stream.kafka.decoder.prop.schema" : "<your avro schema here>"`
@bagi.priyank: got it. thanks!
@bagi.priyank: one more question - do i need to keep port 8098 and 8099 open on server and broker nodes? i am setting everything up manually right now.
@npawar: didnt follow.. are you starting the server and broker on those ports?
@bagi.priyank: Let me take a step back. I am starting empty aws ec2 instances and running instructions for setting up pinot components manually via launches scripts from . I have to set up security groups (open up ports to allow traffic between different instances each for broker, controller, zk and server) and was wondering if not opening up these two ports was why i was seeing a bad table state.
@bagi.priyank: the reason i asked because i saw this in controller logs ```2021/11/01 23:00:13.776 INFO [AssignableInstanceManager] [HelixController-pipeline-task-PinotCluster-(4d561598_TASK)] Current quota capacity: {"Controller_<controller_ip_address>_9000":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}},"Server_<server_ip_address>_8098":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}},"Broker_<broker_ip_address>_8099":{"TASK_EXEC_THREAD":{"DEFAULT":"0/40"}}}```
@npawar: @xiangfu0 ^^ can you help answer this?
@xiangfu0: yes, you need this two ports for pinot to query each instance.
@xiangfu0: broker port is the query entrypoint, server port is used for inter broker-server communication.
@bagi.priyank: Perfect, thanks a ton Neha and Xiang!
@bagi.priyank: I finally got it working. Thanks a ton for all the help. Had to wrangle with the schema json a bit but finally victory!
@npawar: what would have helped you documentation wise? i’ve noted 1. Docs for SimpleAvroMessageDecoder. Anything else ?
@bagi.priyank: The ports that need to be open.
@bagi.priyank: I honestly think that documentation is already really good. If only I had read better without making assumptions I would have been able to figure things out myself. I think I struggled more with schema json than finding information for setting up pinot. So definitely great job by you guys!
@nsanthanam: @nsanthanam has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2021-11-02)

#general

#random

#troubleshooting

#getting-started

Reply via email to