Re: [PR] [Backfill] allow externally partitioned segment uploads for upsert tables [pinot]

via GitHub Thu, 16 May 2024 04:13:16 -0700


rohityadav1993 commented on code in PR #13107:
URL: https://github.com/apache/pinot/pull/13107#discussion_r1603148256



##########
pinot-common/src/main/java/org/apache/pinot/common/metadata/segment/SegmentPartitionMetadata.java:
##########
@@ -48,6 +53,21 @@ public SegmentPartitionMetadata(
       @Nonnull @JsonProperty("columnPartitionMap") Map<String, 
ColumnPartitionMetadata> columnPartitionMap) {
     Preconditions.checkNotNull(columnPartitionMap);
     _columnPartitionMap = columnPartitionMap;
+    _uploadedSegmentPartitionId = -1;
+  }
+
+  /**
+   * Constructor for the class.
+   *
+   * @param columnPartitionMap Column name to ColumnPartitionMetadata map.
+   */
+  @JsonCreator
+  public SegmentPartitionMetadata(
+      @Nullable @JsonProperty("columnPartitionMap") Map<String, 
ColumnPartitionMetadata> columnPartitionMap,
+      @Nullable @JsonProperty(value = "uploadedSegmentPartitionId", 
defaultValue = "-1")

Review Comment:
   > I might have missed it, but how to configure SegmentPartitionConfig in 
TableConfig for tables that allow to upload segments built and partitioned 
externally?
   
   We don't need to configure the table similar to how we don't need to do for 
realtime stream ingestion for upsert tables.
   
   Providing some more context for why the change is needed
   There are two scenarios where data partitioning comes to play:
   1. Query routing: 
[[docs](https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#data-ingested-partitioned-by-some-column)]
 Data partitioning is not a requirement here but a good optimization.
   2. Segment assignment:
       a. If the data is partitioned on a single column and with a Pinot 
supported algorithm, we configure the table as:
       ```
       ...
       "tableIndexConfig": {
         ...
         "segmentPartitionConfig": {
           "columnPartitionMap": {
             "memberId": {
               "functionName": "Modulo",
               "numPartitions": 3
             }
           }
         },
         ...
       },
       ``` 
       **Partitioning for upsert tables**:
       Consuming segment assignement: The stream is always externally 
partitioned (either on PK or other field which can still ensure the PKs are all 
part of the same partition) and does not need to use one of Pinot supported 
algorithms. `segmentPartitionConfig` need not be set for the upsert table 
either. Each `LLCSegmentName` contains a partitionId substring which is derived 
from the streams partitionId. When assigning a segment to instance, we get the 
partition id by parsing the LLCSegmentName in 
`SegmentUtils.getRealtimeSegmentPartitionId`.
       
       **Uploaded segment assignment**: Uploaded segments are not generated 
with LLCSegmentName convention. The only way to specify partitioning info is 
via `segmentPartitionConfig` via table config which is not possible if the 
stream is using custom partitioning.
       If one wants to backfill/uplaod segment to such custom partitioned 
stream, the uploaded segment must provide the partitionId so the 
segmentAssignement can put the segments in the same instances as the consuming 
segments of the same partition un pusert table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Backfill] allow externally partitioned segment uploads for upsert tables [pinot]

Reply via email to