Re: FlumeJava - what happened to PObjects

Kenneth Knowles Fri, 31 Jan 2025 09:00:18 -0800

On Fri, Jan 31, 2025 at 11:20 AM Joey Tran <[email protected]>
wrote:


> Is there an equivalent to `PCollectionView` in python?
>
> > ... as there's not much novel one can do with a PObject vs. a singleton
> PCollection.
>
> Ah maybe I misunderstood how PObjects work. From the FlumeJava paper:
>
>
>> These features can be used to express a computation that needs
>> to iterate until the computed data converges:
>>
>> PCollection<Data> results = computeInitialApproximation();
>> for (;;) {
>>     results = computeNextApproximation(results);
>>     PCollection<Boolean> haveConverged =
>>         results.parallelDo(checkIfConvergedFn(),
>>                            collectionOf(booleans()));
>>     PObject<Boolean> allHaveConverged =
>>         haveConverged.combine(AND_BOOLS);
>>     FlumeJava.run();
>>     if (allHaveConverged.getValue()) break;
>> }
>> ... continue working with converged results ...
>
>
> I had understood this to mean that the `PObject`   will materialize the
> pcollection into in-memory values, which is maybe a little novel?
>

>
At least, in the python SDK, I've always been writing elements to disk by
> transform and then reading it manually outside of the pipeline back into
> the original python objects.
>

Ah, yes, the magic is more in the `FlumeJava.run()` which is equivalent to
doing a whole pipeline.run() and the PObject written to a place where it
can be then read by the main driver program. Our interactive runner and SQL
shell do this for in-memory pipelines, but we don't have anything like that
for our distributed runners. If I understand correctly, scio does something
along these lines. I think having some sort of "standard for storage
between driver main program and distributed pipeline execution" is
compatible with Beam, in theory. I also suspect it is very easy to
accidentally write a program that is weirdly slow in confusing ways,
because of the blurring of the different execution environments/phases.

Kenn


> On Fri, Jan 31, 2025 at 9:50 AM Kenneth Knowles <[email protected]> wrote:
>
>> Fun fact, this was one of my onboarding projects :-)
>>
>> It is the way it is because the invariant we need in order to apply
>> windowing independent of core business logic is: if you apply a transform
>> to windowed input, each window should contain the same output it would if
>> it were the entirety of the data. (composites are permitted to break
>> this rule, but the core compute primitives never do)
>>
>> And I guess it seemed wrong to call something an "Object" when it was
>> really an object per window. TBH the "P" was probably always a misnomer...
>>
>> Kenn
>>
>> On Thu, Jan 30, 2025 at 7:26 PM Robert Bradshaw via dev <
>> [email protected]> wrote:
>>
>>> On Thu, Jan 30, 2025 at 4:25 PM Reuven Lax via dev <[email protected]>
>>> wrote:
>>>
>>>> PCollectionView is the equivalent of PObject. Given that the Beam API
>>>> needed to work with the windowing model, we needed something like a PObject
>>>> that could be windowed. This is what PCollectionView provides.
>>>>
>>>
>>> +1, I'd forgotten about the windowing complexities as well.
>>>
>>>
>>>> On Thu, Jan 30, 2025 at 4:20 PM Joey Tran <[email protected]>
>>>> wrote:
>>>>
>>>>> I read the FlumeJava paper and I was just curious what happened to
>>>>> PObjects. They seem like a useful construct. Do they exist in the java SDK
>>>>> in some version still? Or were they done away with because they made
>>>>> pipeline optimization more difficult?
>>>>>
>>>>

Re: FlumeJava - what happened to PObjects

Reply via email to