Your understanding is correct: both GroupIntoBatches and the Stateful
processing of ParDo are on a per-key basis. The DoFn in this example
uses bufferState
to accumulate the batch and uses countState to keep track of the size of
the batch. I believe GroupIntoBatches implements the same logic, basically
keep the batch elements in a state and emit all the batch once the count of
the elements are the batch size. So if you are trying this example, you
don't need to use GroupIntoBatches.

Note this is not related to the bundle support, which is an internal
implementation of the runner which helps performance. The internal bundling
will let the runner process elements in a micro-batch mode (similar to
spark), so some particular calculation can be optimized, e.g. Python serde
of protobuf messages. Overall it's not visible to the user APIs. The batch
on the user side has to be done explicitly through the PTransforms.

Btw, seems my email still does work well with intuit domain, so please
include my gmail account directly in this email chain, which is added here.

Thanks,
Xinyu

On Fri, Jan 18, 2019 at 2:35 PM Daniel Chen <dch...@linkedin.com> wrote:

>
>
> On 1/18/19, 1:33 PM, "Deshpande, Omkar" <omkar_deshpa...@intuit.com>
> wrote:
>
>     Hey Xinyu,
>
>     I am trying to implement "Batched RPC" as described in -
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Fblog%2F2017%2F08%2F28%2Ftimely-processing.html&amp;data=02%7C01%7Cdchen1%40linkedin.com%7C77e8e21b921042e5a8f908d67d8c8da8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636834439965710970&amp;sdata=wWeVLlQVp3udG2Xg70SovqB66feNxMxYLxxCZRsJIc0%3D&amp;reserved=0
>     The documentation for GroupIntoBatches says "Batches will contain only
> elements of a single key".
>     And my understanding is for "Batched RPC", I need a batch of keys. So,
> I am not sure if I can use GroupIntoBatches.
>
>     On 1/18/19, 10:40 AM, "Xinyu Liu" <xinyuliu...@gmail.com> wrote:
>
>         This email is from an external sender.
>
>
>         sorry, the correct link to the first reference:
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Frunners%2Fsamza%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fsamza%2Fruntime%2FDoFnOp.java&amp;data=02%7C01%7Cdchen1%40linkedin.com%7C77e8e21b921042e5a8f908d67d8c8da8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636834439965710970&amp;sdata=FQ9nJn1S7%2BRTiRwvwKrxAvFhTF076SNuY6RfUnsdkRI%3D&amp;reserved=0
>         .
>
>         Thanks,
>         Xinyu
>
>         On Fri, Jan 18, 2019 at 10:35 AM Xinyu Liu <xinyuliu...@gmail.com>
> wrote:
>
>         > Hi, Omkar,
>         >
>         > Your observation is correct. Currently bundle is implemented in a
>         > per-event basis (Code is DoFnOp.processElement,
>         >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1pIUQ8J658B7GPNDt5dwiJBRQyWep1n2rXlkONyA6ZMM%2Fedit%23gid%3D1709587251&amp;data=02%7C01%7Cdchen1%40linkedin.com%7C77e8e21b921042e5a8f908d67d8c8da8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636834439965710970&amp;sdata=Y48p6YZWoB5AaYgUeVHYrWD3JITAcjvVH5%2FgyZbIN44%3D&amp;reserved=0
> ).
>         > We are working on supporting bundles in Samza right now so in
> Beam we can
>         > take advantage of it. Bundling is also critical to have better
> python
>         > performance so we are trying to get it out very soon (Feb-March).
>         >
>         > On the other hand, in java if you want to process in a batch
> fashion, you
>         > can use the Beam GroupIntoBatches api (
>         >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbeam.apache.org%2Freleases%2Fjavadoc%2F2.0.0%2Forg%2Fapache%2Fbeam%2Fsdk%2Ftransforms%2FGroupIntoBatches.html&amp;data=02%7C01%7Cdchen1%40linkedin.com%7C77e8e21b921042e5a8f908d67d8c8da8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636834439965710970&amp;sdata=ZgKm2XXoromrgxW%2BeEyALoZT%2BLR3fwh3eQkxM17rW5g%3D&amp;reserved=0
> ).
>         > This will group the elements into batches and then deliver to
> your ParDo
>         > afterwards. Please let us know whether this works for you.
>         >
>         > Thanks,
>         > Xinyu
>         >
>         >
>         >
>         > On Fri, Jan 18, 2019 at 10:27 AM Daniel Chen <
> dch...@linkedin.com> wrote:
>         >
>         >> + Xinyu
>         >>
>         >> On 1/18/19, 10:05 AM, "Deshpande, Omkar" <
> omkar_deshpa...@intuit.com>
>         >> wrote:
>         >>
>         >>     Hello,
>         >>
>         >>     I am using Samza runner with Apache Beam. Is there any
> documentation
>         >> available on how bundles are implemented in the Samza runner?
>         >>     I have observed every Kafka record getting processed in its
> own
>         >> bundle. How can I get larger bundles?
>         >>
>         >>      <beam.version>2.9.0 <samza.version>0.14.1
>         >>
>         >>     Thanks,
>         >>     Omkar
>         >>
>         >>
>         >>
>
>
>
>
>

Reply via email to