On Mon, 12 Jun 2023 at 21:30, Jerald Alex <vminf...@gmail.com> wrote: > > hi Weston, > > Thank you so much for taking the time to respond. Really appreciate it. > > I'm using parquet files. So would it be possible to elaborate the below.? I > cannot seem to find any documentation for ParquetFileFragment. > > "there may even be a way to skip row groups by creating a fragment per row > group with ParquetFileFragment." Are you referring subset method?
Yes, this is not very well documented (ParquetFileFragment is missing in the API docs, I see, I opened https://github.com/apache/arrow/issues/36044 for this). The user guide has a brief mention of the `split_by_row_group()` method: https://arrow.apache.org/docs/dev/python/dataset.html#working-with-parquet-datasets (it's this method you need, not `subset()`, which still gives you a single fragment, but dropped row groups based on a filter). So using `split_by_row_group()` you could split each fragment into multiple fragments (one per row group), and then manually skip the first n fragments and only scan the remaining ones. However, this is of course at the granularity of row groups, and not exactly the batch size that you are using. > Regards, > Alex > > On Mon, Jun 12, 2023 at 5:47 PM Weston Pace <weston.p...@gmail.com> wrote: > > > > I would like to know if it is possible to skip the specific set of > > batches, > > > for example, the first 10 batches and read from the 11th Batch. > > > > This sort of API does not exist today. You can skip files by making a > > smaller dataset with fewer files (and I think, with parquet, there may even > > be a way to skip row groups by creating a fragment per row group with > > ParquetFileFragment). However, there is no existing datasets API for > > skipping batches or rows. > > > > > Also, what's the fragment_scan_options in dataset scanner and how do we > > > make use of it? > > > > fragment_scan_options is the spot for configuring format-specific scan > > options. For example, with parquet, you often don't need to bother with > > this and can just use the defaults (I can't remember if nullptr is fine or > > if you need to set this to FileFormat::default_fragment_scan_options but I > > would hope it's ok to just use nullptr. > > > > On the other hand, formats like CSV tend to need more configuration and > > tuning. For example, setting the delimiter, skipping some header rows, > > etc. Parquet is pretty self-describing and you would only need to use the > > fragment_scan_options if, for example, you need to decryption or custom > > control over which columns are encoded as dictionary, etc. > > > > On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote: > > > > > Hi Experts, > > > > > > I have been using dataset.scanner to read the data with specific filter > > > conditions and batch_size of 1000 to read the data. > > > > > > ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches() > > > > > > I would like to know if it is possible to skip the specific set of > > batches, > > > for example, the first 10 batches and read from the 11th Batch. > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner > > > Also, what's the fragment_scan_options in dataset scanner and how do we > > > make use of it? > > > > > > Really appreciate any input. thanks! > > > > > > Regards, > > > Alex > > > > >