#general
@abinanths: @abinanths has joined the channel
@mayanks: Dear Community, looking for some help with `scala-maven-plugin` and gpg signing; ```We are trying to cut the next release for Pinot, and running into the following error. This seems to be due to the fact that `maven-source-plugin` and `scala-maven-plugin` have separate lifecycles, and the gpg files created by the former get deleted by the latter. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.6:sign (sign-artifacts) on project pinot-spark-connector: Exit code: 2 -> [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.6:sign (sign-artifacts) on project pinot-spark-connector: Exit code: 2```
@mayanks: Our question is if this
@ackermanholden: @ackermanholden has joined the channel
@noahprince8: What’s the Pinot philosophy on dimension tables with time series datasets? Say time series `a` joins to time series `b` on some primary key. i.e. ```SELECT * FROM a JOIN b on a.id = b.id WHERE a.timestamp BETWEEN a and b``` If you were to do a traditional join, you potentially scan all of the data in time series `b`. With bloom filters, you could maybe scan fewer columns. You can also specify a time bound on `b.timestamp` to make it slightly less bad. Alternatively, you could create a pre-aggregation before Pinot that joins them together and puts the important metrics from time series `b` on to time series `a`. What’s the preferred way to solve this problem in the Pinot world?
@g.kishore: > Alternatively, you could create a pre-aggregation before Pinot that joins them together and puts the important metrics from time series `b` on to time series `a`. This is the preferred option for Pinot.
@noahprince8: Do people typically do this over a window, and just accept that you’re going to take a latency hit waiting for `b` to come through? Or are there designs that go back and add dimensions? I can see where you could maybe leverage upsert with null columns for this. Not sure how performant it’d be
@chundong.wang: When it comes to real-time segments taking precedence over offline segments during query time, is it hard coded to be 24 hours, or it’d be a merge between real-time and offline segments, so if there’s no data in real-time during the specified time range, aggregates from offline segments would be served as query result?
@g.kishore:
@g.kishore:
@g.kishore: its not hardcoded
@g.kishore:
@chundong.wang: Does this apply if it’s just an offline table, instead of hybrid table?
@g.kishore: no, only for hybrid table
@chundong.wang: Yep got it. Thanks. > In case of hybrid tables, the brokers ensure that the overlap between realtime and offline segment data is queried exactly once, by performing *offline and realtime federation*.
#random
@abinanths: @abinanths has joined the channel
@ackermanholden: @ackermanholden has joined the channel
#troubleshooting
@yash.agarwal: Hey Team, I was working on batch replacement of multiple segments. I am looking to atomically replace all the segments together. I understand that there is an api for replacing segments but how do i configure the whole flow. we are fine with doubling the storage during the ingestion phase.
@mayanks: @snlee ^^
@mayanks: There's work that needs to be done for atomic swap
@mayanks: One way would be to version the segments, and only when all segments (old and new) are online in external view, you make the atomic switch.
@mayanks: Broker would have to know how to route query to one version vs other
@yash.agarwal: Sure. How close or far are we from it?
@mayanks: So far there's only api's. The rest of the work needs to be done
@snlee: @yash.agarwal If you want to replace the segments atomically using the existing replaceSegment API, you need to do the following: Let’s say that we want to replace the segments like the following: (s1, s2, s3 ) -> (s4, s5, s6) 1. Call startReplacement: • segmentsFrom: s1, s2, s3 • segmentsTo: s4, s5, s6 -> The api will return the Id (store this) 2. upload segment s4, s5,s6 3. Call endReplaceSegment • provide Id that you got from the step 1.
@snlee: So, until the step 3 is done, uploaded segment (s4, s5, s6) should not be used.
@yash.agarwal: should not be used or would not be used ? i.e. will pinot handle not using s4,s5,s6 or does the user of the system needs to handle it.
@nguyenhoanglam1990: hi team now I have a problem when I delete the realtime table and recreate the same name ... the segment status always says bad ... can you help me fix this?
@ssubrama: When you delete the table, you should wait until the segments go off externalview. Perhaps you created a table with the same name very soon? Can you try creating a table with another name (and consuming from the same topic)?
@nguyenhoanglam1990:
#time-based-segment-pruner
@noahprince8: So I’m a bit confused in the code here — it appears that when nothing is referencing a `SegmentDataManager` anymore, it gets destroyed. But it still remains in the `_segmentDataManagerMap`. Which means potentially people can acquire destroyed segments?
@g.kishore: @ssubrama ^^
@noahprince8: Seems like they _must_ be getting reloaded every time? Or something?
@ssubrama: iirc it gets off the map when he segment gets unloaded, and the last call to release the segment destroys the segment.
@ssubrama: Look in `BaseTableDataManager`. The code has moved around a bit since I wrote it, but the logic is still the same.
@noahprince8: I have been. The issue is that it destroys the segment, but it still exists in the map
@noahprince8: Now maybe that’s fine because all the metadata on that segment is already defined at this point?
@noahprince8: But it seems for sure the only way for it to get formally removed so that it cannot be retrieved via `acquireSegment` is `removeSegment`. Therefore `acquireSegment` can potentially return segments that have been “released”. Is this an issue?
@ssubrama: The call `removeSegment` removes it from the map.
@noahprince8: Yes, but that call is not used on `releaseSegment`
@ssubrama: No that is not an issue, The last call to release the segment will destroy it. Ths i s because there can be queries in the pipeline
@noahprince8: What is destroyed vs removed?
@noahprince8: Because it sounds like a segment can be destroyed but not removed.
@ssubrama: removed == Segment has been dropped from the table. destroyed == data of the segment is released from the server's memory
@ssubrama: acquire == get a segment for querying release == release the segment you have acquired (locked)
@noahprince8: Okay. So what happens, then, when someone acquires a segment manager that has been released?
@noahprince8: The underlying `ImmutableSegment` has been destroyed.
@ssubrama: So, the following sequence can happen: T1 acquires a segment T2 (helix) removes the segment while query is in progress T1 releases the segment
@ssubrama: In this case, T1 wil end up destroying the data (unmapping the segment)
@ssubrama: Another possible sequence could be that no thread is querying the segment, and helix calls remove(). In that case, the segment is destroyed directly
@noahprince8: What about: ```Query 1 acquires a segment Query 1 releases the segment This is the last reference, so segment is destroyed Query 2 acquires the same segment (no remove call happened)```
@ssubrama: This usually cannot happen, unless the segment has been replaced by another one in the same name.
@noahprince8: In an offline table, presumably you have segments laying around that multiple queries use, past and future?
@noahprince8: Unless there is something that is calling `removeSegment` every time a query finishes? Or something is calling `addSegment` every time a new query starts?
@ssubrama: yes, multiple queries can use the same segment
@ssubrama: You may want to look at the test `BaseTableDataManagerTest` has a demonstration
@noahprince8: We can represent this as a state machine, right. Segment goes from ADDED --> IN USE --> RELEASED --> IN USE Because RELEASED != DROPPED
@ssubrama: i hav a mtg now
@noahprince8: ``` /** * Increases the reference count. Should be called when acquiring the segment. * * @return Whether the segment is still valid (i.e. reference count not 0) */ public synchronized boolean increaseReferenceCount() { if (_referenceCount == 0) { return false; } else { _referenceCount++; return true; } }``` Figured it out.
@noahprince8: There’s a guard here that when the ref count goes to 0, this always returns false, which keeps it from being returned in acquireSegments.
@noahprince8: A bit wasteful on memory. I’m going to include in my PR also removing it from the map when it gets released.
@noahprince8: Thanks for the help. I’m getting pretty close on the segment cold storage. Also just released I posted in the wrong channel :face_palm:
@mayanks: Are you saying segments that would be deleted due to retention are still kept in SegmentDatamanager? That sounds like a bug to fix
@noahprince8: Will have to look into it more, but maybe.
@noahprince8: It's just a dangling ref in the concurrent hashmap that would keep the segment manager object from being garbage collected
@ssubrama: ok, back. @noahprince8 I dont understand your logic.
@ssubrama: So, acquire and release can happen independent of removing the segment,
@ssubrama: the segment data should be cleaned up only when the last reference to it goes away.
@ssubrama: If there is no reference, then the remove operation should clean up the data.
@ssubrama: Agreed
@ssubrama: ?
@noahprince8: Yeah. Need to look at the code again. But I think my point of confusion was that it starts at ref count one. And it locks at 0. And the return result of inc or dec is used in filtering
@mayanks: So the ref counting is for query processing only
@mayanks: The removal happens from a different call stack
@noahprince8: Yeah. And query processing or acquisition can never result in a removal from the segment data manager hash map
@noahprince8: Should be fine. Just a bit of a learning curve here haha
@ssubrama: @ssubrama has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
