Apache Pinot Daily Email Digest (2021-06-11)

Pinot Slack Email Digest Fri, 11 Jun 2021 19:00:35 -0700

#general

@lovenishgoyal: @lovenishgoyal has joined the channel
@prashant.pandey: @prashant.pandey has joined the channel
@patidar.rahul8392: Hi everyone , How to convert a string into double in pinot with sum.functiom .I have tried these 2 queries but getting this error. In my pinot schema file I have data type as string for this field so I m not able to take sum. While wring the transformation in config file I have assign default value as null for this field so it has both non null values and null where data is not available.i guess it is not able to cast the null values into double /decimal. Any way to ignore nulls . I have tried where gross_amount is not null but didn't work. Kindly suggest
@fx19880617: I think you have “null” value in the string which failed the conversion
@fx19880617: can you try to filter out the null value
@fx19880617: select … from myTable where gross_amout <> ‘null’
@patidar.rahul8392: Yes I have nulls in table.ok let me try with <>. I was using is not null.
@patidar.rahul8392: @fx19880617 Worked.
@animeshmishra.ml: @animeshmishra.ml has joined the channel
@gagandeep.singh: Hello guys:wave: For my studies, I am talking about Pinot and its architecture. I created an activity Diagram for demonstrating the query process within the Cluster, but I think some things are missing. I read the original Paper and orient myself on the Query process section. Unfortunately, I was not fully capable of illustrating it. I would appreciate it if some experts could give me some feedback. Thank you very much!
@mayanks: has all what you need
@aaron: I'm trying to understand the difference between Segment URI Push and Segment Metadata Push. I was using Segment URI Push and I filled up the disk on the controller. That seems to make sense to me since the controller had to download all the segments. A couple related questions: 1. If I use metadata push, my understanding is that the controller will direct one of the servers to download the segment instead, is that right? 2. Does that mean the controller will use less disk in that case? 3. Is the final state after URI Push and Metadata Push different? I'd assume in both cases, you should end up with segments distributed across servers, is that right? So I'm just curious why the controller's disk filled up, is it supposed to clean up and isn't doing that, or is this behavior expected?
@mayanks: ```1. Yes. It is true for all pushes though, controller always directs servers to download segments from deepstore (or whatever is configured PinotFs). 2. Yes. Controller only needs to look at segment metadata. There's no need for it to have the entire segment. So metadata push is an optimization to achieve this. 3. Yes, final state is the same. If controller disk filled, you should check what is filling it. if it is the segments, then perhaps controller is not using the deepstore (dataDir not configured)?```
@aaron: Thank you! I'll see if I didn't configure dataDir correctly
@aaron: Hmm so from (1), maybe I don't understand the difference between URI and metadata push
@aaron: It's just that the controller doesn't download from deep store?
@aaron: I do have dataDir configured, here's from the controller conf: ```controller.data.dir= controller.local.temp.dir=/usr/scratch/pinot controller.enable.split.commit=true pinot.controller.storage.factory.class.s3=org.apache.pinot.plugin.filesystem.S3PinotFS pinot.controller.segment.fetcher.protocols=file,http,s3 pinot.controller.segment.fetcher.s3.class=org.apache.pinot.common.utils.fetcher.PinotFSSegmentFetcher # This is required pinot.controller.storage.factory.s3.region=xxx pinot.controller.storage.factory.s3.accessKey=xxx pinot.controller.storage.factory.s3.secretKey=xxx pinot.controller.storage.factory.s3.endpoint=xxx```
@aaron: OH interesting
@aaron: I screwed up and set controller.data.dir twice in the config.
@aaron: First to a path on local disk, second to s3
@aaron: I'm assuming it took the first data dir?
@mayanks: Seems so.
@aaron: Works now, thanks for the help
@aaron: FWIW it seems like it concatenated the two data dirs :slightly_smiling_face:
@aaron: So it was writing to like, `/usr/.../s3:/pinot` on local disk
@mayanks: I see. Do you mind adding these into the FAQ, for the rest of the community? <#C023BNDT0N8|pinot-docsrus>
@pugarg: Hello Team, i wanted to understand one thing does pinot provide capability of overwrite any segment data ( like we do overwrite partitioned data in hive table)
@mayanks: Yes, you can overwrite a segment simply by pushing it again.
@mayanks: The contents of the old segment are overwritten by the contents of the new segment. Segments are matched based on their name
@pugarg: I dont understand, let me explain my use case.. i have date partitioned parquet data in s3. Let say if i dump 10th june on 10th june and again i pushed 10th june on 11th june, i want that pinot overwrite the data and keep the latest data only
@mayanks: So when you generate segments, there's configurable segment naming convention that defaults to (table_name_minTime_maxTime_num).
@mayanks: for example myTable_2021-06-10_2021-06-10_0
@mayanks: When you regenrate the June 10th data on June 11th, the segment will still be named ^^. And when you push it to Pinot, it will overwrite the previously pushed segment of the same name
@mayanks: It is segment name based, not not really time or partition based.
@pugarg: Can you share some reference where i can see how to create segment name??
@mayanks: this is the default
@mayanks: You can also refer to
@lekant: @lekant has joined the channel
@akhilesh.chg: @akhilesh.chg has joined the channel

#random

@lovenishgoyal: @lovenishgoyal has joined the channel
@prashant.pandey: @prashant.pandey has joined the channel
@animeshmishra.ml: @animeshmishra.ml has joined the channel
@lekant: @lekant has joined the channel
@akhilesh.chg: @akhilesh.chg has joined the channel

#troubleshooting

@lovenishgoyal: @lovenishgoyal has joined the channel
@prashant.pandey: @prashant.pandey has joined the channel
@calvin.mwenda: Hey guys, what’s the best way - if any - to deal with JSON queries where 1) Some data points might be null 2) Not all data points have the same schema … here’s some sample data: ```Pinot schema column is called "properties" with a type of string # First Record { timestamp: '...', browser: '...' } # Second Record null # Third Record { balance: '...' }``` This data is provided by the end user and can be whatever they’d like (this is by design) … so there’s really no “schema” for the data in the JSON column.
@pedro.cls93: Depends on what you intend to do with the field. I had a similar case where a json field could be "anything", all I knew is that the max size of the field was 2147483647 chars (2GB per field). I defined the field as a string dimension field in the schema with a reasonable maxLength (by default pinot truncates to 512 chars). Then I added that column to `noDictionaryColumns` & `jsonIndexColumns` in the table configuration. To query the field I use making sure to always specify the default value (usually to `null` )
@calvin.mwenda: Thanks @pedro.cls93… I’d actually set up most of what you mentioned but was using the `JSONEXTRACTSCALAR` function wrong. One more issue I had, for records with `null` I had to convert it to an empty object for it work, otherwise I would get the exception below: ```QueryExecutionError:\njava.lang.IllegalArgumentException: json can not be null ...``` Which makes sense :sweat_smile:
@mayanks: Thanks @pedro.cls93. Would you mind adding this to the FAQ (<#C023BNDT0N8|pinot-docsrus> for details)?
@calvin.mwenda: Hey @mayanks sure thing. Just to clarify, is this about the null entries?
@mayanks: The FAQ question would be about your original question and you can add all details including null from here
@calvin.mwenda: Ah I see :thumbsup:
@animeshmishra.ml: @animeshmishra.ml has joined the channel
@mayanks: @mayanks set the channel topic: Have a question? Check first. Give back to the community by updating the docs (<#C023BNDT0N8|pinot-docsrus>) when you get your answer here.
@lekant: @lekant has joined the channel
@luanmorenomaciel: Hi experts, has someone faced this problem before while submitting a task to pinot's controller? ```root@pinot-controller-0:/opt/pinot# bin/pinot-admin.sh AddTable \ > -schemaFile /opt/pinot/sch_realtime_enriched_music_data_users.json \ > -tableConfigFile /opt/pinot/realtime_enriched_music_data_users_json.json \ > -exec Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:461) at sun.nio.ch.Net.bind(Net.java:453) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) at sun.net.httpserver.ServerImpl.bind(ServerImpl.java:133) at sun.net.httpserver.HttpServerImpl.bind(HttpServerImpl.java:54) at io.prometheus.jmx.shaded.io.prometheus.client.exporter.HTTPServer.<init>(HTTPServer.java:145) at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:31) ... 6 more FATAL ERROR in native method: processing of -javaagent failed Aborted (core dumped) root@pinot-controller-0:/opt/pinot#```
@fx19880617: hmm, can you set `JAVA_OPTS=""`
@luanmorenomaciel: what a precision @fx19880617 impressed !!! seems that worked, let me double check
@akhilesh.chg: @akhilesh.chg has joined the channel

#pinot-dev

@fx19880617: I’m going to merge java 11 upgrade PR() today, for devs still on java8, please use ```mvn clean install -DskipTests -Pbin-dist -T 4 -Djdk.version=8``` to build pinot source code.
@ssubrama: Does it make sense to have a flag for java 11 instead? I am thinking code that links with the segment build logic -- may not get to choose their own java RTE. This is certainly the case with Linkedin. Are there others in a similar situation?
@fx19880617: that’s why we would like to call it out now and say we still support java 8 until next release
@fx19880617: then we will drop java8 support and have people use java11
@fx19880617: otherwise we will never get to upgrade :stuck_out_tongue:
@dlavoie: :tada:

#pinot-docs

@ken: I’m updating some doc, but trying to remember whether the internal representation for BOOLEAN is a string (what I thought) or an int (from existing docs)
@jackie.jxt: @ken We recently introduced the native BOOLEAN support which stores boolean as int in the backend. In the old version (`0.7.1` and older version), there is no native BOOLEAN support and BOOLEAN data type is treated as STRING
@ken: Hi @jackie.jxt - OK, that explains it, thanks!
@ken: So should I doc as-is currently (it’s a string), and when the new support gets released somebody will update the docs? Curious about doc standards for calling out changes in versions.
@jackie.jxt: Good point. We should have version specific docs
@jackie.jxt: I'm also editing the `schema` doc now for the newly added data types, let me add some note describing the behavior
@ken: Sounds good. If you’re in there editing the doc, I’ve got an old note about adding the default max length (for only strings?).
@ken: And it seems weird to have basically the same data for fields at and
@ken: Normally you’d put some stripped down info (like “The field types we support are INT, LONG, ….” with a ref to the reference page for more details.
@ken: Unless, of course, everyone enjoys keeping multiple copies of documentation in sync :slightly_smiling_face:
@jackie.jxt: Good call. I don't even know we have 2 pages for schema...
@jackie.jxt: @ken I just added the new data types and the behavior note to
@jackie.jxt: Let's figure out how to organize these 2 pages and minimize the duplicate material

#pinot-docsrus

@calvin.mwenda: @calvin.mwenda has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org