Apache Pinot Daily Email Digest (2021-06-01)

Pinot Slack Email Digest Tue, 01 Jun 2021 19:00:37 -0700

#general

@hongtaozhang: @hongtaozhang has joined the channel
@krishnalumia535: @krishnalumia535 has joined the channel
@kaustabhganguly: @kaustabhganguly has joined the channel
@kaustabhganguly: Hii everyone ! I'm Kaustabh from India
@kaustabhganguly: I'm a fresh CS grad and just exploring things. I am new to streaming data, kafka and pinot. I want to merge batched data and streaming data and use pinot on top of it. My solution is to use Kafka connect as it's an ideal solution for merging batched and streaming data into topics & partitions. So my pipeline is basically using kafka for merging and then using pinot for streaming from kafka. *Is there a better solution that comes across anyone's mind ? Please correct me if there's any fallacy in my logic.*
@mayanks: Since Pinot can ingest data from offline directly, you could simply have Pinot ingest from a separate offline pipeline as well as Kafka stream.
@kaustabhganguly: Thanks. Trying that out. Will ask here if I have some doubts.
@mayanks: Yes, feel free to ask any questions here
@mayanks:
@mayanks: Also see if you can find your answer ^^. If not, perhaps we can improve the docs
@kaustabhganguly: Sure thing.
@pedro.cls93: Hello, How does Pinot decide if a field in an incoming message is null to apply the defaultNullValue? Does the key of the field have to be missing? For a String field of name `fieldX` with default value `"default"`, ```{ "schemaName": "HitExecutionView", "dimensionFieldSpecs": [ { "name": "fieldX", "dataType": "STRING", "defaultNullValue": "default" },..., ]}``` if an incoming message has the following payload: ```{ ..., "fieldX": null, ..., }``` What is the expected value in Pinot? `null` or `"default"` ?
@mayanks: I’m the incoming null gets translated into default null value and stored in Pinot. So in your example, “default” will be stored
@pedro.cls93: I'm seeing differently, would you mind joining a call with me and taking a look?
@anusha.munukuntla: @anusha.munukuntla has joined the channel
@kylebanker: @kylebanker has joined the channel
@ken: My ops guy is setting up Docker containers, and wants to know why the base Pinot Dockerfile has ```VOLUME ["${PINOT_HOME}/configs", "${PINOT_HOME}/data"]``` since he sees that there’s nothing being stored in the `/data` directory. Any input?
@mayanks: Servers will store local copy of segments there?
@ken: But normally local copies of segments are stored in `/tmp/xxx`, or so I thought?
@dlavoie: By defaults, the OSS helm chart will configure $HOME/data as the data dir for pinot
@dlavoie: It’s in line with the default value of `controller.data.dir` of the helm chart.
@ken: Hmm, OK. So since we’re using HDFS as the deep store, this wouldn’t be getting used, right?
@dlavoie: Indeed
@dlavoie: But keep in mind that servers will use that path
@dlavoie: So the volume defined in the docker image is relevant for the segments stored by the servers.
@ken: But wouldn’t you want that to be temp storage, and not mapped outside of Docker?
@dlavoie: Nope
@dlavoie: It’s the same as kafka
@dlavoie: sure
@dlavoie: brokers can rebuild their data from other replicas and deepstore and everything
@dlavoie: But, trust me, if you want to avoid network jittering when your server are restarting, you’ll be happy with a persistent volume of your segments for the servers
@dlavoie: Segment FS hosted by server should not be considered temporary
@dlavoie: Deepstore download is a fallback in case of lost
@ken: I’ll have to poke around in one of our server processes to see why the ops guy thinks there’s nothing in /data
@ken: Thanks for the input
@dlavoie: Check how your server data dir is configured
@dlavoie: If you want to speed up server restart and avoid redownloading segments from deepstore, configuring the data dir of server in a persistent volume will improve stability of your cluster greatly when things go wrong
@ken: Right. So this would be a `server.data.dir` configuration value?
@dlavoie: `pinot.server.instance.dataDir` :upside_down_face:
@dlavoie: the takeaway is that the volume defined in the dockerfile is opiniated with the oss helm chart and not aligned with the default values from the…. dockerfile itself…
@ken: Nice. I guess `pinot.server.instance.segmentTarDir` can be a temp dir then.
@dlavoie: not exactly
@dlavoie: turns out it more subtle than that :slightly_smiling_face:
@dlavoie: ``` dataDir: /var/pinot/server/data/index segmentTarDir: /var/pinot/server/data/segment```
@dlavoie: `pinot.server.instance.dataDir` is the index storage location, and `pinot.server.instance.segmentTarDir` is the tgz dir
@dlavoie: helm chart stores them both in the same `data` volume of the dockerfile
@ken: OK - seems like could use some editing love. Currently says for `pinot.server.instance.dataDir` “Directory to hold all the data”, and for `pinot.server.instance.segmentTarDir` “Directory to hold temporary segments downloaded from Controller or Deep Store”. But based on above, it’s not “all the data”, and it’s not (really) “temporary segments”.
@dlavoie: The definition of temporary can be loose maybe? :smile:
@dlavoie: If you need to rebuild the segment indexes, there’s value in having the tgz persisted.
@dlavoie: If the definition of `all the data` is the what is on the query path, it is accurate :stuck_out_tongue:
@ken: But if only indexes go into `pinot.server.instance.dataDir`, then you’d need to access the tgz to get data in a column that doesn’t have an index on it.
@dlavoie: @fx19880617 to the rescue for that last one :slightly_smiling_face:
@ken: :slightly_smiling_face: I’ll see what he says when I’m back online after dinner…
@ken: Thanks again
@dlavoie: my pleasure!

#random

#troubleshooting

@hongtaozhang: @hongtaozhang has joined the channel
@chxing: Hi All. If We build a pinot cluster without deep storage , The controller will store all the segments in controller disk configured in `controller.data.dir`, If there have some methods to delete controller segments with retention times, since we don’t have enough disk space to store in controller?
@npawar: In the table config, you can set retentionTimeValue and retentionTimeUnit. Check out the table config documentation
@npawar:
@chxing: Thx for your replay, But I am confused that, if we keep table retentionTimeValue very long like 1 year , all the segments still need stored in controller, which will make controller node full quickly
@npawar: Typically, you would attach an NFS to the contollers for this.
@npawar: @fx19880617 any other suggestions from your experience ? ^^
@krishnalumia535: @krishnalumia535 has joined the channel
@pedro.cls93: Hi guys, is there a safeguard when applying ingestion transformations if the input field is the default value? I.e: Given this transformation: ```{ "columnName": "dateOfBirthMs", "transformFunction": "fromDateTime(dateOfBirth, 'yyyy-MM-dd''T''HH:mm:ss''Z')" }``` And schema definitions: ```"dimensionFieldSpecs": [ ,..., { "name": "dateOfBirth", "dataType": "STRING" },..., ], "dateTimeFieldSpecs": [ ..., { "name": "dateOfBirthMs", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" } ],``` I get this exception: ```java.lang.IllegalStateException: Caught exception while invoking method: public static long org.apache.pinot.common.function.scalar.DateTimeFunctions.fromDateTime(java.lang.String,java.lang.String) with arguments: [null, yyyy-MM-dd'T'HH:mm:ss'Z]``` I was under the impression that Pinot would not apply the transformation if the input field is null or that the transformation itself would be resilient. Is there any way around this?
@mayanks: Yes transform functions should be able to handle nulls. For workaround you can convert it into a groovy function and add the null check for now.
@pedro.cls93: A follow-up question.... How does Pinot decide if a field in an incoming message is null to apply the defaultNullValue? Does the key of the field have to be missing?
@pedro.cls93: For a String field of name `fieldX` with default value `"default"`, if an incoming message has the following payload: ```{ ..., "fieldX": null, ..., }``` What is the expected value in Pinot? `null` or `"default"`
@kaustabhganguly: @kaustabhganguly has joined the channel
@anusha.munukuntla: @anusha.munukuntla has joined the channel
@kylebanker: @kylebanker has joined the channel

#pinot-dev

@anusha.munukuntla: @anusha.munukuntla has joined the channel

#getting-started

@kmvb.tau: @kmvb.tau has joined the channel

#fix_llc_segment_upload

@changliu: Hi @ssubrama, @tingchen, I refactored the ZK access when we get the list of segments for upload retry. Would you mind taking another look? 1. During commit phase, enqueue the segment without download url. 2. `uploadToSegmentStoreIfMissing` reads in-memory segments list to fix, only removes the segment from the in-memory list after successful upload. 3. Only `prefetchLLCSegmentsWithoutDeepStoreCopy` has access to ZK, which is triggered when setup the periodic jobs. Leadership is also checked in this step. And to further alleviate the ZK access, the filter based on segment creation time is added.
@changliu: @ssubrama your concern about shared constant for time limit check is addressed. I added another constant for it: `MIN_TIME_BEFORE_FIXING_SEGMENT_STORE_COPY_MILLIS`
@ssubrama: I will look at it in the next couple of days. thanks for addressing allcomments
@changliu: Thanks Subbu
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org