GitHub user chenlica created a discussion: Join Operator (from old wiki)

>From page https://github.com/apache/texera/wiki/Join-Operator (may be dangling)

====
Author: [Sripad Kowshik Subramanyam](https://www.github.com/sripadks)

## Synopsys
Implement an operator that takes two operators as the input and joins their 
tuples based on constraints specified using a predicate.

## Status
As of 9/25/2016: **COMPLETED**

## Modules
```java
edu.uci.ics.texera.dataflow.common
edu.uci.ics.texera.dataflow.join
```
## Related Issues
https://github.com/Texera/texera/issues/111

## Description
Join Operator performs the join of a certain field of the results of two other 
operators passed to it based on constraints specified in a join predicate. The 
field to join upon and the constraints to be satisfied are specified using 
`JoinPredicate`. The `getNextTuple()` method is used to get the next result of 
the operator.

Currently supported predicates are:
* `JoinDistancePredicate`: Takes in an attribute that specifies the ID, the 
attribute of the field to perform the join on, and a distance threshold. If the 
distance between two spans of the field of the results to be joined is within 
the threshold, the join is performed.

## Example

Given below is a setting and corresponding examples to use 
`JoinDistancePredicate` (consider the two tuples to be from two different 
operators). 

|         | id | author | review                                                
                                  | spanList                            |
|---------|----------|---------------|------------------------------------------------------------------------------------------------|-------------------------------------|
| tuple1 | 58       | Bruce Wayne   | This book gives us a peek into the life 
of Bruce Wayne when he is not fighting crime as Batman | "book":<6,11>          
          |
| tuple2 | 58       | Bruce Wayne   | This book gives us a peek into the life 
of Bruce Wayne when he is not fighting crime as Batman | "gives":<12, 18>, <br> 
"us":<19, 22> |

 Where `<spanStartIndex, spanEndIndex>` represents a span.

 If we want to join over the **review** attribute with the condition **within 
10 character distance**, we can write:

 `JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, 
reviewAttr, 10);`

 Since both tuples have the same ID, we can perform the join on the two span 
lists. 

 The span distance is computed as:

 `|(span 1 spanStartIndex) - (span 2  spanStartIndex)| OR |(span 1 
spanEndIndex) - (span 2 spanEndIndex)|)`

 Upon performing Join on the above two tuples, we get:
 1. The span `"book":<6,11>` from tuple1 and the span `"gives":<12, 18>` from 
tuple2 satisfy the condition _distance <= threshold_. Therefore, the join will 
combine two spans into a new span `"book_gives":<6, 18>`.

 2. The span `"book":<6,11>` from tuple1 and the span `"us":<19, 22>` from 
tuple2 don't satisfy the condition, so they will not be joined.

## TODOs
* Implement sorting of spans of the results in order to improve the performance 
of the operator.
* Implement other kinds of predicates to increase the robustness and utility of 
the operator.

GitHub link: https://github.com/apache/texera/discussions/3974

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to