[ 
https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8065:
------------------------------------------
    Description: 
Currently: a fragment is a product of a scan; it is a lazy collection of scan 
tasks corresponding to a data source which is logically singular (like a single 
file, a single row group, ...). It would be more useful if instead a fragment 
were the direct object of a scan; one scans a fragment (or a collection of 
fragments):

 # Remove {{ScanOptions}} from Fragment's properties and move it into 
{{Fragment::Scan}} parameters.
 # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an 
overload to support predicate pushdown in FileSystemDataset and UnionDataset 
{{Dataset::GetFragments(std::shared_ptr<Expression> predicate)}}.
 # Expose lazy accessor to Fragment::physical_schema()

This will lessen the cognitive dissonance between fragments and files since 
fragments will no longer include references to scan properties.


  was:
Currently: a fragment is a product of a scan; it is a lazy collection of scan 
tasks corresponding to a data source which is logically singular (like a single 
file, a single row group, ...). It would be more useful if instead a fragment 
were the direct object of a scan; one scans a fragment (or a collection of 
fragments):

 # Remove {{ScanOptions}} from Fragment's properties and move it into 
{{Fragment::Scan}} parameters.
 # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an 
overload to support predicate pushdown in FileSystemDataset and UnionDataset 
{{Dataset::GetFragments(std::shared_ptr<Expression> predicate)}}.
 # {{Fragment::schema}} property should be set at construction time, usually 
extracted from the fragment's Dataset.

This will lessen the cognitive dissonance between fragments and files since 
fragments will no longer include references to scan properties.



> [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
> ---------------------------------------------------------
>
>                 Key: ARROW-8065
>                 URL: https://issues.apache.org/jira/browse/ARROW-8065
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++ - Dataset
>            Reporter: Francois Saint-Jacques
>            Priority: Major
>
> Currently: a fragment is a product of a scan; it is a lazy collection of scan 
> tasks corresponding to a data source which is logically singular (like a 
> single file, a single row group, ...). It would be more useful if instead a 
> fragment were the direct object of a scan; one scans a fragment (or a 
> collection of fragments):
>  # Remove {{ScanOptions}} from Fragment's properties and move it into 
> {{Fragment::Scan}} parameters.
>  # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an 
> overload to support predicate pushdown in FileSystemDataset and UnionDataset 
> {{Dataset::GetFragments(std::shared_ptr<Expression> predicate)}}.
>  # Expose lazy accessor to Fragment::physical_schema()
> This will lessen the cognitive dissonance between fragments and files since 
> fragments will no longer include references to scan properties.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to