Re: [Python] Dataset scanner fragment skip options.

Jerald Alex Mon, 12 Jun 2023 12:30:15 -0700

hi Weston,

Thank you so much for taking the time to respond. Really appreciate it.


I'm using parquet files. So would it be possible to elaborate the below.? I
cannot seem to find any documentation for ParquetFileFragment.

"there may even be a way to skip row groups by creating a fragment per row
group with ParquetFileFragment." Are you referring subset method?

Regards,
Alex

On Mon, Jun 12, 2023 at 5:47 PM Weston Pace <weston.p...@gmail.com> wrote:

> > I would like to know if it is possible to skip the specific set of
> batches,
> > for example, the first 10 batches and read from the 11th Batch.
>
> This sort of API does not exist today.  You can skip files by making a
> smaller dataset with fewer files (and I think, with parquet, there may even
> be a way to skip row groups by creating a fragment per row group with
> ParquetFileFragment).  However, there is no existing datasets API for
> skipping batches or rows.
>
> > Also, what's the fragment_scan_options in dataset scanner and how do we
> > make use of it?
>
> fragment_scan_options is the spot for configuring format-specific scan
> options.  For example, with parquet, you often don't need to bother with
> this and can just use the defaults (I can't remember if nullptr is fine or
> if you need to set this to FileFormat::default_fragment_scan_options but I
> would hope it's ok to just use nullptr.
>
> On the other hand, formats like CSV tend to need more configuration and
> tuning.  For example, setting the delimiter, skipping some header rows,
> etc.  Parquet is pretty self-describing and you would only need to use the
> fragment_scan_options if, for example, you need to decryption or custom
> control over which columns are encoded as dictionary, etc.
>
> On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote:
>
> > Hi Experts,
> >
> > I have been using dataset.scanner to read the data with specific filter
> > conditions and batch_size of 1000 to read the data.
> >
> > ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches()
> >
> > I would like to know if it is possible to skip the specific set of
> batches,
> > for example, the first 10 batches and read from the 11th Batch.
> >
> >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner
> > Also, what's the fragment_scan_options in dataset scanner and how do we
> > make use of it?
> >
> > Really appreciate any input. thanks!
> >
> > Regards,
> > Alex
> >
>

Re: [Python] Dataset scanner fragment skip options.

Reply via email to