Thanks Xiaotian. My comments are inline. Some of these questions below seem to be related to overall Deep Store by-pass design rather than this table config change. I would provide brief answers below while directing most of them to the cwiki design discussion.
On Mon, May 18, 2020 at 5:39 PM Xiaotian Jiang <jackie....@gmail.com> wrote: > Seems we always use HTTP for inner cluster communication (e.g. > TableSizeReader). > We should make peer download the default behavior. If the segment > download from deep storage failed, we should always try with the peer. > For the race condition issue, we can add CRC into the peer download > request (CRC can be obtained from the ZK metadata). > > The race condition pointed out by @mcvsubbu <mcvsu...@apache.org> is mainly for offline segment cases -- we will not address it in this design but leave it as a TODO. When you say "peer download the default behavior" as the default, I suppose you mean after (1) we test this feature for realtime tables and (2) coming up with a design for offline tables and then test on offline tables too. If this is the case, this table config change is the right step toward that direction. The following questions are more related to overall design: There are some parts not clear to me: > - Can we always obtain the deep storage address before successfully > uploading it? > - Should we try to at least upload once before committing the segment? > With a new time-bounded and best effort segment uploader added recently (PR 5314 <https://github.com/apache/incubator-pinot/pull/5314>), the server will wait for a configurable amount of time for the segment upload to succeed before committing the segment. If the upload succeeds (within the timeout period), a deep storage address will be returned. - If we put downloadUrl before successfully uploading the segment, how > to detect whether the upload is successful or not? > Similar to my answer to the previous question, with PR 5314 <https://github.com/apache/incubator-pinot/pull/5314> we will upload the segment before commit. I would update the cwiki about this change. The main factor driving this change is to utilize the SplitCommitter framework. > - When we find segment not uploaded to the deep storage, who is > responsible of uploading it, who is responsible of updating the > downloadUrl (controller or server)? > There would be follow up PR <https://cwiki.apache.org/confluence/display/PINOT/By-passing+deep-store+requirement+for+Realtime+segment+completion?focusedCommentId=152115613#By-passingdeep-storerequirementforRealtimesegmentcompletion-RealtimeValidationManagerchanges> on RealtimeValidation manager to check and let servers to upload the segment to the deep store. The controller will update the zk data. > > Jackie > > On Mon, May 18, 2020 at 4:13 PM TING CHEN <tingc...@uber.com> wrote: > > > > > > > > ---------- Forwarded message --------- > > From: TING CHEN <tingc...@uber.com> > > Date: Tue, May 5, 2020 at 5:16 PM > > Subject: Adding a new field in SegmentsValidationAndRetentionConfig for > peer segment download > > To: <dev@pinot.apache.org> > > > > > > > > As part of the proposal to bypass deep store for segment completion, I > plan to add a new optional string field peerSegmentDownloadScheme to the > SegmentsValidationAndRetentionConfig in the TableConfig. The value can be > http or https. > > > > SplitSegmentCommitter will check this value. If it exists, the segment > committer will be able to finish segment commit successfully even if the > upload to the segment store fails. The committer will report a special > marker to the controller about the segment is available in peer servers. > > When Pinot servers fail to download segments from the segment store, > they can also check this field's value. If it exists, it can download > segments from peer servers using either HTTP and HTTPS segment fetchers as > configured. (related PR in review for how to discover such servers.) > > > > Note this is a table level config. We will test the new download > behavior in realtime tables in incremental fashion. Once fully proven, this > config can be upgraded to server level config. > > > > Please let me know if you have any questions on this. Thanks @mcvsubbu > for coming up with the idea and offline discussions. > > > > Ting Chen > > >