jaketf commented on a change in pull request #11596: URL: https://github.com/apache/beam/pull/11596#discussion_r421827964
########## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java ########## @@ -472,24 +548,120 @@ public void initClient() throws IOException { this.client = new HttpHealthcareApiClient(); } + @GetInitialRestriction + public OrderedTimeRange getEarliestToLatestRestriction(@Element String hl7v2Store) + throws IOException { + from = this.client.getEarliestHL7v2SendTime(hl7v2Store, this.filter); + // filters are [from, to) to match logic of OffsetRangeTracker but need latest element to be + // included in results set to add an extra ms to the upper bound. + to = this.client.getLatestHL7v2SendTime(hl7v2Store, this.filter).plus(1); + return new OrderedTimeRange(from, to); + } + + @NewTracker + public OrderedTimeRangeTracker newTracker(@Restriction OrderedTimeRange timeRange) { + return timeRange.newTracker(); + } + + @SplitRestriction + public void split( + @Restriction OrderedTimeRange timeRange, OutputReceiver<OrderedTimeRange> out) { + // TODO(jaketf) How to pick optimal values for desiredNumOffsetsPerSplit ? Review comment: Yeah I think the "spiky backfill" (many cases in a small sendTime) is a corner case of a hot split that would just be slow and users would have to accept that or take it up with their upstream system. splitting on messageType / sendFacility are probably more popular logical filters and feels like a hack for a corner case that might mess with performance under the "typical" distribution of data in sendTime. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org