Python API - Issue with the join operator keys

Aurélien Bertrand Tue, 25 Mar 2025 01:13:34 -0700

Dear all,

My name is Aurélien and I have been helping with the latest demo paper by
devising ParquetSource operators which I plan to commit soon.


While implementing the Spark forecast pipeline in Wayang, I noticed that
the JoinOperator implemented in Python was not working for me. It acts as a
cartesian product because it fails to get the keys (and thus the probeTable
in the JavaJoinOperator looks like {null: [all keys]} for both data quanta).

I implemented a simple test case (same as the TestJavaJoinOperator):

Left:
1,"b"
1,"c"
2,"d"
3,"e"

Right:
1,"x"
1,"y"
2,"z"
4,"w"

I wondered whether someone had tested the operator yet or if anyone can
manage to get the expected results from this sample data.

I checked the implementation: the keys are effectively extracted in the
JoinOperator and sent to the worker.py, and then something seems to
happen... The PythonWorkerManager reads the dataquanta effectively but not
the keys. Are they supposed to be read there, and does anyone know what
could go wrong (e.g., reading simple data like [1])?

Thank you in advance for your help.

Best regards,
Aurélien

Python API - Issue with the join operator keys

Reply via email to