how to make acero output order by batch index

Wenbo Hu Tue, 25 Jul 2023 04:13:00 -0700

Hi,
    I'm trying to zip two streams with same order but different processes.
    For example, the original stream comes with two column 'id' and
'age', and splits into two stream processed distributedly using acero:
1. hash the 'id' into a stream with single column 'bucket_id' and 2.
classify 'age' into ['child', 'teenage', 'adult',...]. And then zip
into a single stream.


       [      'id'      |      'age'      | many other columns]
                |                  |                        |
       ['bucket_id']   ['classify']                |
                 |                  |                       |
              [zipped_stream | many_other_columns]
I was expecting both bucket_id and classify can keep the same order as
the orginal stream before they are zipped.
According to document, "ordered execution" is using batch_index to
indicate the order of batches.
but acero::DeclarationToReader with a QueryOptions that sequce_output
is set to true does not mean that it keeps the order if the input
stream is not ordered. But it doesn't fail during the execution
(bucket_id and classify are not specify any ordering). Then How can I
make the acero produce a stream that keep the order as the original
input?
-- 
---------------------
Best Regards,
Wenbo Hu,

how to make acero output order by batch index

Reply via email to