Re: [Python] Dataset scanner fragment skip options.

Joris Van den Bossche Tue, 13 Jun 2023 01:28:24 -0700

On Mon, 12 Jun 2023 at 21:30, Jerald Alex <vminf...@gmail.com> wrote:
>
> hi Weston,
>
> Thank you so much for taking the time to respond. Really appreciate it.
>
> I'm using parquet files. So would it be possible to elaborate the below.? I
> cannot seem to find any documentation for ParquetFileFragment.
>
> "there may even be a way to skip row groups by creating a fragment per row
> group with ParquetFileFragment." Are you referring subset method?


Yes, this is not very well documented (ParquetFileFragment is missing
in the API docs, I see, I opened
https://github.com/apache/arrow/issues/36044 for this). The user guide
has a brief mention of the `split_by_row_group()` method:
https://arrow.apache.org/docs/dev/python/dataset.html#working-with-parquet-datasets
(it's this method you need, not `subset()`, which still gives you a
single fragment, but dropped row groups based on a filter).

So using `split_by_row_group()` you could split each fragment into
multiple fragments (one per row group), and then manually skip the
first n fragments and only scan the remaining ones. However, this is
of course at the granularity of row groups, and not exactly the batch
size that you are using.

> Regards,
> Alex
>
> On Mon, Jun 12, 2023 at 5:47 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > > I would like to know if it is possible to skip the specific set of
> > batches,
> > > for example, the first 10 batches and read from the 11th Batch.
> >
> > This sort of API does not exist today.  You can skip files by making a
> > smaller dataset with fewer files (and I think, with parquet, there may even
> > be a way to skip row groups by creating a fragment per row group with
> > ParquetFileFragment).  However, there is no existing datasets API for
> > skipping batches or rows.
> >
> > > Also, what's the fragment_scan_options in dataset scanner and how do we
> > > make use of it?
> >
> > fragment_scan_options is the spot for configuring format-specific scan
> > options.  For example, with parquet, you often don't need to bother with
> > this and can just use the defaults (I can't remember if nullptr is fine or
> > if you need to set this to FileFormat::default_fragment_scan_options but I
> > would hope it's ok to just use nullptr.
> >
> > On the other hand, formats like CSV tend to need more configuration and
> > tuning.  For example, setting the delimiter, skipping some header rows,
> > etc.  Parquet is pretty self-describing and you would only need to use the
> > fragment_scan_options if, for example, you need to decryption or custom
> > control over which columns are encoded as dictionary, etc.
> >
> > On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote:
> >
> > > Hi Experts,
> > >
> > > I have been using dataset.scanner to read the data with specific filter
> > > conditions and batch_size of 1000 to read the data.
> > >
> > > ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches()
> > >
> > > I would like to know if it is possible to skip the specific set of
> > batches,
> > > for example, the first 10 batches and read from the 11th Batch.
> > >
> > >
> > >
> > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner
> > > Also, what's the fragment_scan_options in dataset scanner and how do we
> > > make use of it?
> > >
> > > Really appreciate any input. thanks!
> > >
> > > Regards,
> > > Alex
> > >
> >

Re: [Python] Dataset scanner fragment skip options.

Reply via email to