[
https://issues.apache.org/jira/browse/BEAM-4461?focusedWorklogId=178774&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-178774
]
ASF GitHub Bot logged work on BEAM-4461:
----------------------------------------
Author: ASF GitHub Bot
Created on: 26/Dec/18 20:53
Start Date: 26/Dec/18 20:53
Worklog Time Spent: 10m
Work Description: reuvenlax commented on pull request #7353: [BEAM-4461]
Support inner and outer style joins in CoGroup.
URL: https://github.com/apache/beam/pull/7353
Multiple improvements to the schema CoGroup transform:
* Allow the user to use strings instead of TupleTags. TupleTags existed to
make Java type inference work, and this is not needed with the schema-based
join as the types are in the schema. This also allows a simpler builder for
PCollectionTuple.
* Instead of multiple CoGroup.byFieldNames, byFieldIds, etc. the new
syntax is CoGroup.join(By.fieldNames), CoGroup.join(By.fieldIds), etc. This
shrinks the API surface area, and also provides a place to provide per-input
options (used for outer joins).
* Add a .crossProductJoin. This expands the iterables into an inner-product.
For example:
PCollection<Row> innerJoined = inputs.apply(
CoGroup.join("input1", By.fieldNames("user"))
.join("input2", By.fieldNames("user"))
.crossProductJoin();
* Each input can be marked for "outer-join" participation semantics. This
means that if no records for that input are present for a join key, an output
is still generated from the cross product with the value for that input
replaced by a null. This generalizes normal left/right/full outer joins to N
inputs. For example with two inputs:
PCollection<Row> leftOuterJoined = inputs.apply(
CoGroup.join("input1",
By.fieldNames("user").withOuterJoinParticipation())
.join("input2", By.fieldNames("user"))
.crossProductJoin();
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 178774)
Time Spent: 19h (was: 18h 50m)
> Create a library of useful transforms that use schemas
> ------------------------------------------------------
>
> Key: BEAM-4461
> URL: https://issues.apache.org/jira/browse/BEAM-4461
> Project: Beam
> Issue Type: Sub-task
> Components: sdk-java-core
> Reporter: Reuven Lax
> Assignee: Reuven Lax
> Priority: Major
> Time Spent: 19h
> Remaining Estimate: 0h
>
> e.g. JoinBy(fields). Project, Filter, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)