llama90 commented on issue #39142: URL: https://github.com/apache/arrow/issues/39142#issuecomment-1848207567
Hello. Why are you representing data as a single row? If it's not necessary, you might be able to proceed as follows. I think if you use the `RecordBatch` with schema, you can operate join using [`Acero`](https://arrow.apache.org/docs/cpp/streaming_execution.html). Based on what I know, `ExecBatchFromJson` is usually used for testing. Anyway when you do a join operation, it works like this. <details><summary>code snippet</summary> ```cpp BatchesWithSchema input_left; input_left.batches = {ExecBatchFromJSON({int32(), int32()}, R"( [ [1, 24], [2, 34], [3, 25] ] )")}; input_left.schema = schema({field("col_1", int32()), field("col_2", int32())}); BatchesWithSchema input_right; input_right.batches = {ExecBatchFromJSON({int32(), int32()}, R"( [ [2, 24], [2, 34], [3, 25] ] )")}; input_right.schema = schema({ field("col_1", int32()), field("col_2", int32()), }); // Hash join options for inner join HashJoinNodeOptions join_opts{ JoinType::INNER, /*left_keys=*/{"col_1"}, /*right_keys=*/{"col_1"}, literal(true), "_l", "_r", false}; // Creating join declaration Declaration left{"source", SourceNodeOptions{input_left.schema, input_left.gen(/*parallel=*/true, /*slow=*/false)}}; Declaration right{ "source", SourceNodeOptions{input_right.schema, input_right.gen(/*parallel=*/true, /*slow=*/false)}}; Declaration join{"hashjoin", {std::move(left), std::move(right)}, join_opts}; ASSERT_OK_AND_ASSIGN(auto actual, DeclarationToExecBatches(std::move(join), false)); std::cout << "Actual schema: " << actual.schema->ToString() << std::endl; for (const auto& batch : actual.batches) { std::cout << "Batch: " << batch.ToString() << std::endl; } /** * Actual schema: col_1_l: int32 col_2_l: int32 col_1_r: int32 col_2_r: int32 Batch: ExecBatch # Rows: 3 0: Array[2,2,3] 1: Array[34,34,25] 2: Array[2,2,3] 3: Array[34,24,25] */ ``` </details> Then, if you need to, you can do a `project` operation to pick out only the rows you need. The code you're looking at is used for unit testing. If this approach suits your purpose, you may need to write code like the following: 1. Read data and create a `RecordBatch`. 2. Use this `RecordBatch` to perform a join operation. 3. (If necessary) Obtain only the needed rows by performing a `project` operation on the result of the join. I hope this information is helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
