Re: [Python] Dataset scanner fragment skip options.

Weston Pace Mon, 12 Jun 2023 08:46:56 -0700

> I would like to know if it is possible to skip the specific set of
batches,
> for example, the first 10 batches and read from the 11th Batch.

This sort of API does not exist today.  You can skip files by making a
smaller dataset with fewer files (and I think, with parquet, there may even
be a way to skip row groups by creating a fragment per row group with
ParquetFileFragment).  However, there is no existing datasets API for
skipping batches or rows.

> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?

fragment_scan_options is the spot for configuring format-specific scan
options.  For example, with parquet, you often don't need to bother with
this and can just use the defaults (I can't remember if nullptr is fine or
if you need to set this to FileFormat::default_fragment_scan_options but I
would hope it's ok to just use nullptr.

On the other hand, formats like CSV tend to need more configuration and
tuning.  For example, setting the delimiter, skipping some header rows,
etc.  Parquet is pretty self-describing and you would only need to use the
fragment_scan_options if, for example, you need to decryption or custom
control over which columns are encoded as dictionary, etc.

On Mon, Jun 12, 2023 at 8:11 AM Jerald Alex <vminf...@gmail.com> wrote:

> Hi Experts,
>
> I have been using dataset.scanner to read the data with specific filter
> conditions and batch_size of 1000 to read the data.
>
> ds.scanner(filter=pc.field('a') != 3, batch_size=1000).to_batches()
>
> I would like to know if it is possible to skip the specific set of batches,
> for example, the first 10 batches and read from the 11th Batch.
>
>
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner
> Also, what's the fragment_scan_options in dataset scanner and how do we
> make use of it?
>
> Really appreciate any input. thanks!
>
> Regards,
> Alex
>

Re: [Python] Dataset scanner fragment skip options.

Reply via email to