[
https://issues.apache.org/jira/browse/BEAM-2660?focusedWorklogId=132645&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-132645
]
ASF GitHub Bot logged work on BEAM-2660:
----------------------------------------
Author: ASF GitHub Bot
Created on: 08/Aug/18 21:23
Start Date: 08/Aug/18 21:23
Worklog Time Spent: 10m
Work Description: cjmcgraw commented on issue #3619: [BEAM-2660] Set
PubsubIO batch size using builder
URL: https://github.com/apache/beam/pull/3619#issuecomment-411557232
currently my company is using this as a batch for loading prediction tuples
in fast batch. We are using this in Dataflow as we speak, and have been since
this fork was created. Our use case most likely won't need to be streaming. So
the change is effective for my problem.
That being said I am not fully groking the issue here. I'd like to get
clarity for when/if someone stumbles across this in the future.
@dadrian
> What? For one, this PR doesn't touch the source, just the sink. Second, if
that's the case, how do we get this fixed in the Dataflow runner? I currently
have code running in prod that rolls it's own Pubsub client to compensate for
this size limitation, and I'd really like to get rid of it.
@reuvenlax
> @dadrian true of both the source and the sink, at least for Dataflow
streaming. Dataflow's batch runner does use this code.
@aromanenko-dev
> Yes, that is why I was wondering how it's related to any specific runner
and @reuvenlax explained that it's happened that Dataflow runner has it's own
implementation for Pubsub support.
If I recall the limitation with the sink was that it was using the gcloud
SDK to submit a grpc request. There was a hard coded default of the maximum
number of bytes that one bulk request could be. I simply allowed the hard coded
value to be dynamic.
Since the implementation was in the builder for the sink, I applied the
values to both the bounded and unbounded sinks.
The source request didn't have a maximum message size API parameter. So it
will be enforced by Pubsub instead of Beam.
If I am understanding this all correctly. This means that it can be used in
both the bounded and unbounded cases.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 132645)
Time Spent: 3h 20m (was: 3h 10m)
> Set PubsubIO batch size using builder
> -------------------------------------
>
> Key: BEAM-2660
> URL: https://issues.apache.org/jira/browse/BEAM-2660
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
> Reporter: Carl McGraw
> Assignee: Chamikara Jayalath
> Priority: Major
> Labels: gcp, java, pubsub, sdk
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> PubsubIO doesn't allow users to set the publish batch size. Instead the value
> is hard coded in both the BoundedPubsubWriter and the UnboundedPubsubSink.
> google's pub/sub is bound to a maximum of 10mb per request size. My company
> has run into problems with events that are individually smaller than 1mb, but
> when batched in the 100 or 2000 default batch sizes causes pubsub to fail to
> send the event.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)