Re: [I] How to implement efficient Join between two RecordBatch? [arrow]

via GitHub Fri, 08 Dec 2023 19:40:14 -0800


llama90 commented on issue #39142:
URL: https://github.com/apache/arrow/issues/39142#issuecomment-1848207567


   Hello. 
   
   Why are you representing data as a single row? If it's not necessary, you 
might be able to proceed as follows.
   
   I think if you use the `RecordBatch` with schema, you can operate join using 
[`Acero`](https://arrow.apache.org/docs/cpp/streaming_execution.html).
   
   Based on what I know, `ExecBatchFromJson` is usually used for testing. 
Anyway when you do a join operation, it works like this.
   
   <details><summary>code snippet</summary>
   
   ```cpp
     BatchesWithSchema input_left;
     input_left.batches = {ExecBatchFromJSON({int32(), int32()}, R"( [
                      [1, 24],
                      [2, 34],
                      [3, 25]
                    ] )")};
     input_left.schema = schema({field("col_1", int32()), field("col_2", 
int32())});
   
     BatchesWithSchema input_right;
     input_right.batches = {ExecBatchFromJSON({int32(), int32()}, R"( [
                      [2, 24],
                      [2, 34],
                      [3, 25]
                    ] )")};
     input_right.schema = schema({
         field("col_1", int32()),
         field("col_2", int32()),
     });
   
     // Hash join options for inner join
     HashJoinNodeOptions join_opts{
         JoinType::INNER,
         /*left_keys=*/{"col_1"},
         /*right_keys=*/{"col_1"}, literal(true), "_l", "_r", false};
   
     // Creating join declaration
     Declaration left{"source",
                      SourceNodeOptions{input_left.schema,
                                        input_left.gen(/*parallel=*/true, 
/*slow=*/false)}};
     Declaration right{
         "source", SourceNodeOptions{input_right.schema,
                                     input_right.gen(/*parallel=*/true, 
/*slow=*/false)}};
     Declaration join{"hashjoin", {std::move(left), std::move(right)}, 
join_opts};
   
     ASSERT_OK_AND_ASSIGN(auto actual, 
DeclarationToExecBatches(std::move(join), false));
   
     std::cout << "Actual schema: " << actual.schema->ToString() << std::endl;
     for (const auto& batch : actual.batches) {
       std::cout << "Batch: " << batch.ToString() << std::endl;
     }
   
     /**
      * Actual schema: col_1_l: int32
       col_2_l: int32
       col_1_r: int32
       col_2_r: int32
       Batch: ExecBatch
       # Rows: 3
       0: Array[2,2,3]
       1: Array[34,34,25]
       2: Array[2,2,3]
       3: Array[34,24,25]
      */
   ```
   
   </details>
   
   Then, if you need to, you can do a `project` operation to pick out only the 
rows you need.
   
   The code you're looking at is used for unit testing. If this approach suits 
your purpose, you may need to write code like the following:
   
   1. Read data and create a `RecordBatch`.
   2. Use this `RecordBatch` to perform a join operation.
   3. (If necessary) Obtain only the needed rows by performing a `project` 
operation on the result of the join.
   
   I hope this information is helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] How to implement efficient Join between two RecordBatch? [arrow]

Reply via email to