R-JunmingChen commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1585477388

   >If you only need to load specific batches of data then could you create a 
row group for each batch? Or a separate file for each batch?
   
   I can't know the batch_size of reading when I write it to disk so I can't 
create a row group/file with a suitable size.  Since the batch_size of reading 
is deicded by spill_over_count and buffer_size ( like buffer_size /  
spill_over_count ), the spill_over_count can't be determined untill all the 
inputs are finished. 
   
   > If you need random access to batches of data (e.g. you don't know the row 
group boundaries at write time but it isn't random access to rows) then we 
could maybe use the row skip feature that was recently added to parquet (I 
don't think it has been exposed yet).
   
   Sorry for my confused description. The real problem is that I wanna make 
this `Future<std::optional<ExecBatch>> FetchNextBatch(int spill_index);` work. 
So, for a specific `example_spill_over_file_one.parquet`, I should read with 
`row_offset` of `batch_size * 0` and batch size of `batch_size` at fisrt and 
when I use up the data for comparing I should then read `row_offset` of 
`batch_size * 1` and batch size of `batch_size`...... and so on.
   
   The skip feature can solve my problem more easily.
   Currently,  I have used AsyncGenerator like what the **source_node.cc** does 
to read back data. I think it's enough to solve my problem?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to