[GitHub] [arrow] westonpace commented on a diff in pull request #13782: ARROW-17287: [C++] Create scan node that doesn't rely on the merged generator

GitBox Thu, 04 Aug 2022 08:01:53 -0700


westonpace commented on code in PR #13782:
URL: https://github.com/apache/arrow/pull/13782#discussion_r937901786



##########
cpp/src/arrow/dataset/dataset.h:
##########
@@ -59,6 +156,17 @@ class ARROW_DS_EXPORT Fragment : public 
std::enable_shared_from_this<Fragment> {
   virtual Result<RecordBatchGenerator> ScanBatchesAsync(
       const std::shared_ptr<ScanOptions>& options) = 0;
 
+  /// \brief Inspect a fragment to learn basic information

Review Comment:
   Yes, exactly.  I'm not a huge fan of this because now fragment scanning is 
broken into three steps...
   
   FragmentScanner: Inspect fragment (new step)
   Scanner: Create evolution
   FragmentScanner: Open reader (where we previously determined column names / 
schema)
   Scanner: Start iteration task
   FragmentScanner: Scan
   
   I could push the evolution creation into the fragment scanner but that makes 
a new step that every fragment scanner has to do and so I think that would be 
worse.
   
   Your point on "actual parquet footer" and "the schema" is also correct.  My 
"prototypical future evolution strategy" that I am trying to make sure we plan 
for is to resolve field references by parquet column id.  So it would work in 
this fashion:
   
    * User specifies numeric field references against the dataset schema
    * Evolution strategy has a lookup table from numeric position in dataset 
schema  to column id
    * On inspection, physical column ids are inserted into the schema and 
passed to the evolution
    * Evolution uses the lookup table combined with the physical column ids to 
determine the physical columns that are being asked for
    
    So this way the inspect step would need to return something with the column 
IDs.  This could be the parquet footer.  However, it also could be the schema 
extracted from the footer as we have a mechanism for encoding those column IDs 
into the schema today.  This way we could leave the door open to someone adding 
column IDs to IPC files as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #13782: ARROW-17287: [C++] Create scan node that doesn't rely on the merged generator

Reply via email to