[
https://issues.apache.org/jira/browse/ARROW-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Molina updated ARROW-12243:
--------------------------------------
Fix Version/s: (was: 6.0.0)
7.0.0
> [C++] Datasets/Fragment/ScanOptions should be immutable
> -------------------------------------------------------
>
> Key: ARROW-12243
> URL: https://issues.apache.org/jira/browse/ARROW-12243
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
> Fix For: 7.0.0
>
>
> A dataset is a collection of fragments, a file format, and some partitioning
> rules.
> A fragment is a description of some location to get data from.
> The ScanOptions describe how the user wants the scan to behave.
>
> These things don't need mutable state. Currently, datasets do not have any
> state and its methods could be converted to const painlessly.
>
> The parquet file fragment is a little tricky. It caches statistics gleamed
> lazily from the parquet metadata at scan time. This is a little confusing
> and feels like it should be a part of a scan (although that would mean we
> lose out from multiple scans of the same dataset). Also, though unlikely,
> what if the files were replaced in the meantime?
>
> The scan options currently get populated as the scan runs based on expression
> simplification and other information figured out at scan time. Again, this
> information feels like it should be part of a scan.
>
> These different pieces make the code confusing (to me at least) to reason on.
> For example, there was some recent discussion in
> [https://github.com/apache/arrow/pull/9802] about whether a fragment or a
> dataset should be reusable (the conclusion was they should not but I still
> think maybe they should be).
>
> Just to be clear. I'm not stating these things as self evident and obvious.
> This is more of a "for discussion" issue. I also don't have a concrete idea
> of implementation. It would seem there would end up being a "scan context"
> again of some kind and it might even be persisted to use on future scans to
> save time (e.g. the cached parquet statistics).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)