Chloe Thonin created BEAM-6970:
----------------------------------
Summary: Java SDK : How to bounded messages from PubSub ?
Key: BEAM-6970
URL: https://issues.apache.org/jira/browse/BEAM-6970
Project: Beam
Issue Type: Bug
Components: runner-dataflow, sdk-java-core
Affects Versions: 2.11.0
Environment: Beam worker : Dataflow (GCP)
Reporter: Chloe Thonin
I want to set up a dataflow Pipeline in Streaming mode.
My dataflow do theses tasks :
* Read messages from pubsub
* Build my "Document" object
* Insert this Document into BigQuery
* Store the initial message into Google Cloud Storage
The code successfully build and run, but it takes a lot of time for some
messages.
I think it takes a lot of time because it treats pusub message one by one :
-> I build a PCollection<Documents> with messages from pubsub.
{code:java}
PCollection<Documents> rawtickets = pipeline
.apply("Read PubSub Events", PubsubIO.readStrings().fromTopic(TOPIC))
.apply("Make Document ", ParDo.of(new DoFn<String, Documents>() {
@ProcessElement
public void processElement(ProcessContext c) throws
Exception {
String rawXMl = c.element();
/** My code for build my object "Document" **/
c.output(rawXml);
}
}
))
.setCoder(AvroCoder.of(Documents.class));
{code}
Here a picture of my complete process :
[Complete Process |https://zupimages.net/up/19/14/w4df.png]
We can see the latency of the process after reading messages.
Concretely, I search to create a Pcollection of N element. I have tried methods
founds here :
[https://beam.apache.org/documentation/programming-guide/#triggers] but it
doesn't group my pubsub messages.
How I can batch this process ?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)