[ https://issues.apache.org/jira/browse/ARROW-12243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-12243: --------------------------------- Summary: [C++] Datasets/Fragment/ScanOptions should be immutable (was: Datasets/Fragment/ScanOptions should be immutable) > [C++] Datasets/Fragment/ScanOptions should be immutable > ------------------------------------------------------- > > Key: ARROW-12243 > URL: https://issues.apache.org/jira/browse/ARROW-12243 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > Fix For: 5.0.0 > > > A dataset is a collection of fragments, a file format, and some partitioning > rules. > A fragment is a description of some location to get data from. > The ScanOptions describe how the user wants the scan to behave. > > These things don't need mutable state. Currently, datasets do not have any > state and its methods could be converted to const painlessly. > > The parquet file fragment is a little tricky. It caches statistics gleamed > lazily from the parquet metadata at scan time. This is a little confusing > and feels like it should be a part of a scan (although that would mean we > lose out from multiple scans of the same dataset). Also, though unlikely, > what if the files were replaced in the meantime? > > The scan options currently get populated as the scan runs based on expression > simplification and other information figured out at scan time. Again, this > information feels like it should be part of a scan. > > These different pieces make the code confusing (to me at least) to reason on. > For example, there was some recent discussion in > [https://github.com/apache/arrow/pull/9802] about whether a fragment or a > dataset should be reusable (the conclusion was they should not but I still > think maybe they should be). > > Just to be clear. I'm not stating these things as self evident and obvious. > This is more of a "for discussion" issue. I also don't have a concrete idea > of implementation. It would seem there would end up being a "scan context" > again of some kind and it might even be persisted to use on future scans to > save time (e.g. the cached parquet statistics). -- This message was sent by Atlassian Jira (v8.3.4#803005)