[ 
https://issues.apache.org/jira/browse/ARROW-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-12243:
---------------------------------
    Summary: [C++] Datasets/Fragment/ScanOptions should be immutable  (was: 
Datasets/Fragment/ScanOptions should be immutable)

> [C++] Datasets/Fragment/ScanOptions should be immutable
> -------------------------------------------------------
>
>                 Key: ARROW-12243
>                 URL: https://issues.apache.org/jira/browse/ARROW-12243
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>             Fix For: 5.0.0
>
>
> A dataset is a collection of fragments, a file format, and some partitioning 
> rules.
> A fragment is a description of some location to get data from.
> The ScanOptions describe how the user wants the scan to behave.
>  
> These things don't need mutable state.  Currently, datasets do not have any 
> state and its methods could be converted to const painlessly.
>  
> The parquet file fragment is a little tricky.  It caches statistics gleamed 
> lazily from the parquet metadata at scan time.  This is a little confusing 
> and feels like it should be a part of a scan (although that would mean we 
> lose out from multiple scans of the same dataset).  Also, though unlikely, 
> what if the files were replaced in the meantime?
>  
> The scan options currently get populated as the scan runs based on expression 
> simplification and other information figured out at scan time.  Again, this 
> information feels like it should be part of a scan.
>  
> These different pieces make the code confusing (to me at least) to reason on. 
>  For example, there was some recent discussion in 
> [https://github.com/apache/arrow/pull/9802] about whether a fragment or a 
> dataset should be reusable (the conclusion was they should not but I still 
> think maybe they should be).
>  
> Just to be clear.  I'm not stating these things as self evident and obvious.  
>  This is more of a "for discussion" issue.  I also don't have a concrete idea 
> of implementation.  It would seem there would end up being a "scan context" 
> again of some kind and it might even be persisted to use on future scans to 
> save time (e.g. the cached parquet statistics).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to