R-JunmingChen commented on issue #35268: URL: https://github.com/apache/arrow/issues/35268#issuecomment-1585477388
>If you only need to load specific batches of data then could you create a row group for each batch? Or a separate file for each batch? I can't know the batch_size of reading when I write it to disk so I can't create a row group/file with a suitable size. Since the batch_size of reading is deicded by spill_over_count and buffer_size ( like buffer_size / spill_over_count ), the spill_over_count can't be determined untill all the inputs are finished. > If you need random access to batches of data (e.g. you don't know the row group boundaries at write time but it isn't random access to rows) then we could maybe use the row skip feature that was recently added to parquet (I don't think it has been exposed yet). Sorry for my confused description. The real problem is that I wanna make this `Future<std::optional<ExecBatch>> FetchNextBatch(int spill_index);` work. So, for a specific `example_spill_over_file_one.parquet`, I should read with `row_offset` of `batch_size * 0` and batch size of `batch_size` at fisrt and when I use up the data for comparing I should then read `row_offset` of `batch_size * 1` and batch size of `batch_size`...... and so on. The skip feature can solve my problem more easily. Currently, I have used AsyncGenerator like what the **source_node.cc** does to read back data. I think it's enough to solve my problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
