litao3rd commented on issue #37840:
URL: https://github.com/apache/arrow/issues/37840#issuecomment-1734062063
> ```c++
> ds::FinishOptions finishOptions;
> finishOptions.inspect_options.fragments = 30;
> auto dataset = factory->Finish(finishOptions).ValueOrDie();
> ```
>
> I think the result from `ScannerBuilder::Finish` has a `Inspect`, by
default, `inspect_options.fragments == 1`, so it will only inspect one file,
and regard its schema as the final schema. We can enlarge
`inspect_options.fragments` to collect more schemas
>
> However, the most suitable ways is to set the schema in `FinishOptions`:
>
> ```c++
> ::arrow::SchemaBuilder builder;
> builder.AddField(::arrow::field("dispatching_base_num",
::arrow::large_utf8()));
> builder.AddField(::arrow::field("pickup_datetime",
::arrow::timestamp(::arrow::TimeUnit::SECOND)));
> builder.AddField(::arrow::field("dropoff_datetime",
::arrow::timestamp(::arrow::TimeUnit::SECOND)));
> builder.AddField(::arrow::field("PULocationID", ::arrow::int32()));
> builder.AddField(::arrow::field("DOLocationID", ::arrow::int32()));
> builder.AddField(::arrow::field("SR_Flag", ::arrow::int32()));
> builder.AddField(::arrow::field("dispatching_base_number",
::arrow::large_utf8()));
> finishOptions.schema = builder.Finish().ValueOrDie();
> auto dataset = factory->Finish(finishOptions).ValueOrDie();
> ```
>
> Explicit set a schema is better in this case.
It's truly remarkable! The method you demonstrate is certainly helpful for
someone like me who is not very familier with Arrow. I sincerely appreciate
your all kindful replies.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]