hi Weston, Thank you so much for taking the time to respond. Really appreciate it.
I'm using parquet files. So would it be possible to elaborate the below.? I cannot seem to find any documentation for ParquetFileFragment. "there may even be a way to skip row groups by creating a fragment per row group with ParquetFileFragment." Are you referring subset method? Regards, Alex On Mon, Jun 12, 2023 at 5:47 PM Weston Pace <weston.p...@gmail.com> wrote: > > I would like to know if it is possible to skip the specific set of > batches, > > for example, the first 10 batches and read from the 11th Batch. > > This sort of API does not exist today. You can skip files by making a > smaller dataset with fewer files (and I think, with parquet, there may even > be a way to skip row groups by creating a fragment per row group with > ParquetFileFragment). However, there is no existing datasets API for > skipping batches or rows. > > > Also, what's the fragment_scan_options in dataset scanner and how do we > > make use of it? > > fragment_scan_options is the spot for configuring format-specific scan > options. For example, with parquet, you often don't need to bother with > this and can just use the defaults (I can't remember if nullptr is fine or > if you need to set this to FileFormat::default_fragment_scan_options but I > would hope it's ok to just use nullptr. > > On the other hand, formats like CSV tend to need more configuration and > tuning. For example, setting the delimiter, skipping some header rows, > etc. Parquet is pretty self-describing and you would only need to use the > fragment_scan_options if, for example, you need to decryption or custom > control over which columns are encoded as dictionary, etc. > > On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote: > > > Hi Experts, > > > > I have been using dataset.scanner to read the data with specific filter > > conditions and batch_size of 1000 to read the data. > > > > ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches() > > > > I would like to know if it is possible to skip the specific set of > batches, > > for example, the first 10 batches and read from the 11th Batch. > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner > > Also, what's the fragment_scan_options in dataset scanner and how do we > > make use of it? > > > > Really appreciate any input. thanks! > > > > Regards, > > Alex > > >