Re: FlumeJava - what happened to PObjects

Kenneth Knowles Fri, 31 Jan 2025 09:26:29 -0800

On Fri, Jan 31, 2025 at 12:06 PM Joey Tran <[email protected]>
wrote:


> Yeah I briefly considered reimplementing PObject in python to make unit
> testing asserts a little more ergonomic (`results_pobject.getvalue() ==
> set([1, 2, 3])` feels more natural than `assert_that(results, equal_to([1,
> 2, 3]))`) but when I proposed it to some beam users at my company someone
> excitedly asked if they could start adding conditional forks in their
> pipeline and that made me a lot more hesitant.
>
> I suppose it wouldn't be that hard to have our internal distributed runner
> just check to see if there are any PObject transforms and reject a pipeline
> if it does.
>
> Anyways, I asked here because I was wondering if the same line of thought
> might've lead to the removal of PObject (just out of curiosity)
>

Well, I'd recommend this takeaway from the answers here: having an API so
that the main() driver program can read in the contents of a PCollection is
very doable and largely orthogonal to the rename/revamp of PObject to
PCollectionView.

One big win with newer ideas about portable side inputs is that they can
have more than one API. The API for ye olde PObject, or equivalently a
PCollectionView in the global window, is to just read the whole contents of
the PCollection. So this only really works for very small data, while we've
now got implementations for Map and Set views with more limited APIs that
work for big data. If you built an API on top of Beam that really focused
on having persistent queryable intermediate materializations, you might
want to use those. But more likely you would just limit it to small data to
control a loop, say. Or just materialize it to an appropriate storage
system that query natively like you already do.

Kenn


>
> On Fri, Jan 31, 2025 at 11:57 AM Kenneth Knowles <[email protected]> wrote:
>
>>
>>
>> On Fri, Jan 31, 2025 at 11:20 AM Joey Tran <[email protected]>
>> wrote:
>>
>>> Is there an equivalent to `PCollectionView` in python?
>>>
>>> > ... as there's not much novel one can do with a PObject vs. a
>>> singleton PCollection.
>>>
>>> Ah maybe I misunderstood how PObjects work. From the FlumeJava paper:
>>>
>>>
>>>> These features can be used to express a computation that needs
>>>> to iterate until the computed data converges:
>>>>
>>>> PCollection<Data> results = computeInitialApproximation();
>>>> for (;;) {
>>>>     results = computeNextApproximation(results);
>>>>     PCollection<Boolean> haveConverged =
>>>>         results.parallelDo(checkIfConvergedFn(),
>>>>                            collectionOf(booleans()));
>>>>     PObject<Boolean> allHaveConverged =
>>>>         haveConverged.combine(AND_BOOLS);
>>>>     FlumeJava.run();
>>>>     if (allHaveConverged.getValue()) break;
>>>> }
>>>> ... continue working with converged results ...
>>>
>>>
>>> I had understood this to mean that the `PObject`   will materialize the
>>> pcollection into in-memory values, which is maybe a little novel?
>>>
>>
>>>
>> At least, in the python SDK, I've always been writing elements to disk by
>>> transform and then reading it manually outside of the pipeline back into
>>> the original python objects.
>>>
>>
>> Ah, yes, the magic is more in the `FlumeJava.run()` which is equivalent
>> to doing a whole pipeline.run() and the PObject written to a place where it
>> can be then read by the main driver program. Our interactive runner and SQL
>> shell do this for in-memory pipelines, but we don't have anything like that
>> for our distributed runners. If I understand correctly, scio does something
>> along these lines. I think having some sort of "standard for storage
>> between driver main program and distributed pipeline execution" is
>> compatible with Beam, in theory. I also suspect it is very easy to
>> accidentally write a program that is weirdly slow in confusing ways,
>> because of the blurring of the different execution environments/phases.
>>
>> Kenn
>>
>>
>>> On Fri, Jan 31, 2025 at 9:50 AM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> Fun fact, this was one of my onboarding projects :-)
>>>>
>>>> It is the way it is because the invariant we need in order to apply
>>>> windowing independent of core business logic is: if you apply a transform
>>>> to windowed input, each window should contain the same output it would if
>>>> it were the entirety of the data. (composites are permitted to break
>>>> this rule, but the core compute primitives never do)
>>>>
>>>> And I guess it seemed wrong to call something an "Object" when it was
>>>> really an object per window. TBH the "P" was probably always a misnomer...
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jan 30, 2025 at 7:26 PM Robert Bradshaw via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> On Thu, Jan 30, 2025 at 4:25 PM Reuven Lax via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> PCollectionView is the equivalent of PObject. Given that the Beam API
>>>>>> needed to work with the windowing model, we needed something like a 
>>>>>> PObject
>>>>>> that could be windowed. This is what PCollectionView provides.
>>>>>>
>>>>>
>>>>> +1, I'd forgotten about the windowing complexities as well.
>>>>>
>>>>>
>>>>>> On Thu, Jan 30, 2025 at 4:20 PM Joey Tran <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I read the FlumeJava paper and I was just curious what happened to
>>>>>>> PObjects. They seem like a useful construct. Do they exist in the java 
>>>>>>> SDK
>>>>>>> in some version still? Or were they done away with because they made
>>>>>>> pipeline optimization more difficult?
>>>>>>>
>>>>>>

Re: FlumeJava - what happened to PObjects

Reply via email to