westonpace commented on code in PR #13782:
URL: https://github.com/apache/arrow/pull/13782#discussion_r937901786
##########
cpp/src/arrow/dataset/dataset.h:
##########
@@ -59,6 +156,17 @@ class ARROW_DS_EXPORT Fragment : public
std::enable_shared_from_this<Fragment> {
virtual Result<RecordBatchGenerator> ScanBatchesAsync(
const std::shared_ptr<ScanOptions>& options) = 0;
+ /// \brief Inspect a fragment to learn basic information
Review Comment:
Yes, exactly. I'm not a huge fan of this because now fragment scanning is
broken into three steps...
FragmentScanner: Inspect fragment (new step)
Scanner: Create evolution
FragmentScanner: Open reader (where we previously determined column names /
schema)
Scanner: Start iteration task
FragmentScanner: Scan
I could push the evolution creation into the fragment scanner but that makes
a new step that every fragment scanner has to do and so I think that would be
worse.
Your point on "actual parquet footer" and "the schema" is also correct. My
"prototypical future evolution strategy" that I am trying to make sure we plan
for is to resolve field references by parquet column id. So it would work in
this fashion:
* User specifies numeric field references against the dataset schema
* Evolution strategy has a lookup table from numeric position in dataset
schema to column id
* On inspection, physical column ids are inserted into the schema and
passed to the evolution
* Evolution uses the lookup table combined with the physical column ids to
determine the physical columns that are being asked for using
So this way the inspect step would need to return something with the column
IDs. This could be the parquet footer. However, it also could be the schema
extracted from the footer as we have a mechanism for encoding those column IDs
into the schema today. This way we could leave the door open to someone adding
column IDs to IPC files as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]