> I would like to know if it is possible to skip the specific set of batches, > for example, the first 10 batches and read from the 11th Batch.
This sort of API does not exist today. You can skip files by making a smaller dataset with fewer files (and I think, with parquet, there may even be a way to skip row groups by creating a fragment per row group with ParquetFileFragment). However, there is no existing datasets API for skipping batches or rows. > Also, what's the fragment_scan_options in dataset scanner and how do we > make use of it? fragment_scan_options is the spot for configuring format-specific scan options. For example, with parquet, you often don't need to bother with this and can just use the defaults (I can't remember if nullptr is fine or if you need to set this to FileFormat::default_fragment_scan_options but I would hope it's ok to just use nullptr. On the other hand, formats like CSV tend to need more configuration and tuning. For example, setting the delimiter, skipping some header rows, etc. Parquet is pretty self-describing and you would only need to use the fragment_scan_options if, for example, you need to decryption or custom control over which columns are encoded as dictionary, etc. On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote: > Hi Experts, > > I have been using dataset.scanner to read the data with specific filter > conditions and batch_size of 1000 to read the data. > > ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches() > > I would like to know if it is possible to skip the specific set of batches, > for example, the first 10 batches and read from the 11th Batch. > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner > Also, what's the fragment_scan_options in dataset scanner and how do we > make use of it? > > Really appreciate any input. thanks! > > Regards, > Alex >