#general
@hasanat.muztaba: @hasanat.muztaba has joined the channel
@mark.r.watkins: @mark.r.watkins has joined the channel
@banlinhlinhcon: @banlinhlinhcon has joined the channel
@diogo.baeder: @diogo.baeder has joined the channel
#random
@hasanat.muztaba: @hasanat.muztaba has joined the channel
@mark.r.watkins: @mark.r.watkins has joined the channel
@banlinhlinhcon: @banlinhlinhcon has joined the channel
@diogo.baeder: @diogo.baeder has joined the channel
#troubleshooting
@nadeemsadim: can we apply inverted indexing or range indexing to existing old table columns with tens of millions of records and what impact it will have on the table performance and should I expect query latency to improve for old data queried as well ..
@xiangfu0: it will improve the queries with predicates on the columns you put indexes
@xiangfu0: you can apply it on old tables then reload all the segments
@nadeemsadim: what im,pact wil it have on heap memory
@xiangfu0: there will be extra disk space overhead
@nadeemsadim: will i require to increase heap memory since indexing will be added separately as bitmap in heap memory
@nadeemsadim: will it be stored in ram or ssd
@xiangfu0: index are on ssd, runtime pinot will load data/indexes
@nadeemsadim: ok once processed .. they will be written on ssd and gc will clear the heap eventually for hot segments which are being consumed?
@xiangfu0: right
@xiangfu0: it will be off-heap
@nadeemsadim: how difficult is it to apply lucene fst indexing for regexp queries
@nadeemsadim: I mean i onl;y need to apply the indexing on the string columns or anything else also needed
@xiangfu0: yes
@xiangfu0: apply index, then it will improve the text search query
@xiangfu0: but you pay extra disk space for index
@nadeemsadim: what is the preferred or recommended indexing for substring while doing group by and filter .. lucene fst?
@xiangfu0: lucene will help on filter not group by
@nadeemsadim: ok filter will prune a lot of segments / records
@nadeemsadim: that will also improve latency drastically i guess
@nadeemsadim: since we are also doing filter with substring on a string column
@nadeemsadim: so lucene fst on that string column where we are doing substring while filter will help?
@nadeemsadim: *apply index, then it will improve the text search query* => will regexp query or substring query performance wont improve if i use lucene fst and only will improve if i use text_match clause ?
@nadeemsadim: cc: @mayanks @jackie.jxt @ssubrama @g.kishore @npawar @ken @dlavoie
@nadeemsadim: cc: @mohamedkashifuddin @shaileshjha061 @hussain
@xiangfu0: it will improve the text_match
@xiangfu0: substring won’t get benefit or regex
@nadeemsadim: ok
@nadeemsadim: if time column is primary time column in schema and stored as long data type in milliseconds epoch with granularity as seconds .. is it necessary to use range indexing on time column on my tables if all queries have filter(where clause) with respect to time say last 5 min or last 1 hour or last 30 days @xiangfu0.. I mean by default segments do have start time and end time if time is primary time column I GUESS .. and segment pruning will happen by default or range indexing on time column is needed?
@xiangfu0: segment prune will take care of start/end time if it’s in your predicate
@nadeemsadim: so range indexing needed ?
@xiangfu0: no
@nadeemsadim: ok got it
@xiangfu0: it will help, but since your segments are ordered by time already, so I feel it’s not necessary to put index
@xiangfu0: do you have sort index
@xiangfu0: you can sort time
@nadeemsadim: ok
@nadeemsadim: but the time coming on kafka topic may not be sorted .. I mean some unordered /unsorted time may come while ingestion
@nadeemsadim: so is sorted index recommeded in such a scenario?
@nadeemsadim: for time column
@xiangfu0: pinot will sort data when it’s persisting the data to disk
@xiangfu0: but you can only pick one sorted index
@nadeemsadim: ok then time column i will select as sorted indexing column
@nadeemsadim: then range indexing wont be needed on time column .. right
@nadeemsadim: also sorted indexing will internally also be doing inverted indexing
@nadeemsadim: also one last doubt .. inverted indexing performance will be same for string type columns and long/int columns or int/long columns are going to give better performance even after indexing applied .. my use case is I have some tenantId or other ID columns as string but they are digits only .. so is it necessary to keep them as long type in schema and then only apply inverted indexing or I can even use string data type with inverted indexing and it wont have much impact on query latency while filtering or group by ..
@xiangfu0: you can use string, but you will pay more storage cost
@nadeemsadim: latency wise not much impact?
@xiangfu0: if you can use long type to represent id, then suggest to use long
@nadeemsadim: ok
@xiangfu0: latency wise, using long will definitely help
@nadeemsadim: cool
@ssubrama: @nadeemsadim please use the controller endpoint to recommend indexes. As for inverted indexes, the only time they are on heap is when a segment is consming state. Once segments are completed (or they are offline segments) there is nothing stored in heap
@nadeemsadim: Ok @ssubrama got it
@hasanat.muztaba: @hasanat.muztaba has joined the channel
@mark.r.watkins: @mark.r.watkins has joined the channel
@banlinhlinhcon: @banlinhlinhcon has joined the channel
@31cswarszawa: Hey everyone! I’m working on creating a realtime pinot table that is based on messages published on kafka topic. My goal is to see messages pushed to kafka in pinot table as fast as it is possible. Here’s my current table config: ``` "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "simple", "stream.kafka.topic.name": "bb8_logs", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.zk.broker.url": "zookeeper:2181/kafka", "stream.kafka.broker.list": "kafka:9092", "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.time": "1h", "realtime.segment.flush.desired.size": "50M", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }``` Wanted to ask you what should be changed to see kafka topic messages in pinot table as fast as it is possible?
@nadeemsadim: use "stream.kafka.consumer.type": "lowlevel",
@nadeemsadim: and give more partitions of the kafkatopic
@nadeemsadim: pinot throughput should increase considerably
@nadeemsadim: if your p[roducer app is writing data on all partitions of the kafka topic from where pinot is consuming
@31cswarszawa: How can I increase a number of partitions per topic?
@31cswarszawa: Probably there’s some config I’m missing
@nadeemsadim: on kafka
@nadeemsadim: u can use alter partitions command
@nadeemsadim: ```bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40 ```
@31cswarszawa: But currently I can consume messages from kafka topic right after they’re published (with kafkacat). I think that lag is on a Pinot table side.
@nadeemsadim: yeah
@31cswarszawa: So number of partitions will improve this?
@31cswarszawa: (I though that kafka setting does not influence how pinot consumes data)
@nadeemsadim: yep hopefully
@nadeemsadim: otherwise some other issue
@31cswarszawa: yep, your change worked like a charm!
@31cswarszawa: Many thanks:blush:
@nadeemsadim: welcome
@ssubrama: @31cswarszawa increasing the number of partitions in kafka should be done carefully. Pinot has per-partition overheads, so it is not wise to just set the number of partition to (say) 512. Ideally, you want to start with a small number of partitions an dincrease it
@ssubrama: kafka does not allow decrease of partitions
@ssubrama: @31cswarszawa run the realtime prov tool. Try with different number of partitions, and see what the mem usage looks like. And then you can decide on the number ofpartitions. Data is available instantly after it is in kafka. If you are not seeing that, then increasing kafka partitions is not going to do you any good.
@diogo.baeder: @diogo.baeder has joined the channel
@gqian3: Hi team, we are trying to do aggregation on a string field, e.g select name, max(url) from table group by name. But resulting numformatting exception. Is there any other way we can get one aggregated string value from a group? Thanks
@mayanks: How about distinct + order by?
@gqian3: Yes we tried that, but the problem is one name can map to multiple url, so it would return multiple rows for each distinct name. We were hoping to get just one url from the group by name.
@mayanks: @jackie.jxt we should be able to allow applicable aggr functions (max,min) on String columns?
@nisheetkmr: Hi, I am trying to load parquet files using spark ingestion task. I had build the jars for java 8 using command ```mvn install package -DskipTests -DskipiTs -Pbin-dist -Drat.ignoreErrors=true -Djdk.version=8 -Dspark.version=2.4.5``` While running the task i getting the error ```21/08/20 15:11:24 ERROR LaunchDataIngestionJobCommand: Exception caught: Can't construct a java object for tag:
@nisheetkmr: Ingestion spec: ```executionFrameworkSpec: name: 'spark' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.spark.SparkSegmentUriPushJobRunner' extraConfigs: stagingDir: 's3://<bucket>/<staging_path>/' jobType: SegmentCreationAndTarPush inputDirURI: 's3://<bucket>/<parquet_file_path>/' includeFileNamePattern: 'glob:**/*.parquet' outputDirURI: 's3://<bucket>/<path>/' overwriteOutput: true pinotFSSpecs: - scheme: s3 className: org.apache.pinot.plugin.filesystem.S3PinotFS configs: region: '<region>' recordReaderSpec: dataFormat: 'parquet' className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec: tableName: '<pinot_table_name>' pinotClusterSpecs: - controllerURI: '<controller_uri>' pushJobSpec: pushParallelism: 2 pushAttempts: 2 pushRetryIntervalMillis: 1000```
@nisheetkmr: I have checked spark cluster logs and I can see that the jars are present and I tried running the command `Class.forName("org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec")` using notebook and it ran the command successfully Cluster spark conf: ```spark.driver.extraClassPath=/home/yarn/pinot/plugins/pinot-batch-ingestion/pinot-batch-ingestion-spark/pinot-batch-ingestion-spark-0.8.0-shaded.jar:/home/yarn/pinot/lib/pinot-all-0.8.0-jar-with-dependencies.jar:/home/yarn/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.8.0-shaded.jar:/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.8.0-shaded.jar:/home/yarn/pinot/pinot-spi-0.8.0-SNAPSHOT.jar spark.driver.extraJavaOptions=-Dplugins.dir=/home/yarn/pinot/plugins -Dplugins.include=pinot-s3,pinot-parquet``` In logs i can see the jars loaded ```21/08/20 15:10:25 INFO DriverCorral: Successfully attached library s3a://<jar_path>/pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar to Spark ....... ....... 21/08/20 15:11:21 WARN SparkContext: The jar /local_disk0/tmp/addedFile4953461200729207388pinot_all_0_8_0_SNAPSHOT_jar_with_dependencies-2dd68.jar has been added already. Overwriting of added jars is not supported in the current version.``` And the plugins are also successfully loaded ```21/08/20 15:11:23 INFO PluginManager: Plugins root dir is [/home/yarn/pinot/plugins] 21/08/20 15:11:23 INFO PluginManager: Trying to load plugins: [[pinot-s3, pinot-parquet]] 21/08/20 15:11:23 INFO PluginManager: Trying to load plugin [pinot-parquet] from location [/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet] 21/08/20 15:11:23 INFO PluginManager: Successfully loaded plugin [pinot-parquet] from jar files: [file:/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.8.0-SNAPSHOT-shaded.jar] 21/08/20 15:11:23 INFO PluginManager: Successfully Loaded plugin [pinot-parquet] from dir [/home/yarn/pinot/plugins/pinot-input-format/pinot-parquet] 21/08/20 15:11:23 INFO PluginManager: Trying to load plugin [pinot-s3] from location [/home/yarn/pinot/plugins/pinot-file-system/pinot-s3] 21/08/20 15:11:23 INFO PluginManager: Successfully loaded plugin [pinot-s3] from jar files: [file:/home/yarn/pinot/plugins/pinot-file-system/pinot-s3/pinot-s3-0.8.0-SNAPSHOT-shaded.jar] 21/08/20 15:11:23 INFO PluginManager: Successfully Loaded plugin [pinot-s3] from dir [/home/yarn/pinot/plugins/pinot-file-system/pinot-s3]```
@mayanks: Seems like the yaml file has issues, can you double check on that?
@nisheetkmr: According to stack trace, the cause is class not found ```Caused by: org.yaml.snakeyaml.error.YAMLException: Class not found: org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec at org.yaml.snakeyaml.constructor.Constructor.getClassForNode(Constructor.java:660) at org.yaml.snakeyaml.constructor.Constructor$ConstructYamlObject.getConstructor(Constructor.java:335)``` The line of code where it breaks also suggests the same ``` try { cl = getClassForName(name); } catch (ClassNotFoundException e) { throw new YAMLException("Class not found: " + name); } protected Class<?> getClassForName(String name) throws ClassNotFoundException { return Class.forName(name); }``` This doesn’t looks like a yaml file issue. I had cross verified yaml file and it looks fine. I have attached the yaml file for cross verification
@nisheetkmr: Complete stack trace: ```21/08/20 15:11:24 ERROR LaunchDataIngestionJobCommand: Got exception to generate IngestionJobSpec for data ingestion job - Can't construct a java object for tag:
@npawar: Are there extra spaces before "executionFrameworkSpec" ? Looks like it from the message you pasted. Yaml is very particular about spaces
@nisheetkmr: No extra spaces before `executionFrameworkSpec`. I have attached yaml file as well in above message for cross reference
@bruce.ritchie: Try using --jars instead of extraClasspath
@mayanks: @kulbir.nijjer I recall you recently faced the issue, spark wasn’t seeing the class. Can we add FAQ around it?
@kulbir.nijjer: @nisheetkmr please try with --jars option first but if that doesn't help then in your ingestion Spec YAML add entry `dependencyJarDir` in extraConfigs, where you will need to first copy all the jars present in <PINOT_HOME>/plugins folder to S3 ```extraConfigs: # stagingDir is used in distributed filesystem to host all the segments then move this directory entirely to output directory. stagingDir: 's3://<somepath>/' dependencyJarDir: 's3://<somePath>/plugins'``` and that should address the CNFE. This is a known issue and we are debugging why jars are not getting deployed added to YARN nodes, so meanwhile above could be a short term workaround.
@mayanks: Thanks @kulbir.nijjer
@anu110195: Hey, Getting this sometimes while making connection from local
@anu110195: requests.exceptions.ConnectionError: HTTPConnectionPool(host='pinot2-controller-external.data2.svc', port=9000): Max retries exceeded with url: /tables/requests2/schema (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x105fc4280>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
@anu110195: sometimes its works, but most of time it doesn't and throws above error
#getting-started
@kangren.chia: am i missing something, but if you have ```# table1
@kangren.chia: i tried keeping segments for different tables in the same root dir with different prefixes and it worked
@kangren.chia: although this wasnt obvious when i tried to look in the docs, maybe i missed where?
@g.kishore: Are you storing the segments on the controller?
@kangren.chia: no, im using deep store
@kangren.chia: i am trying to tweak my settings for fast queries now
@kangren.chia: i haven’t really seen a guide or resource for this, so trying different segment size / server settings / index types
@g.kishore: Agree, we can definitely improve the docs
@kangren.chia: ah im not complaining here. just explaining the context why i have different segments for different tables (benchmarking)
@mayanks: I think the current usage model is that you may have controller data dir as the parent of the two tables in s3. cc @kulbir.nijjer @xiangfu0
@tiger: When using S3 as the deepstore with the SegmentCreationAndMetadataPush, should the `controller.data.dir` (from the controller conf) be the same as the `outputDirURI` (from the ingestion jobspec) ?
@kangren.chia: i’ve tried this and you can have: ```outputDirUri for ingestion job 1:
@kulbir.nijjer: @kulbir.nijjer has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org