[
https://issues.apache.org/jira/browse/ORC-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
William Hyun closed ORC-1172.
-----------------------------
> add row count limit config for one stripe
> ------------------------------------------
>
> Key: ORC-1172
> URL: https://issues.apache.org/jira/browse/ORC-1172
> Project: ORC
> Issue Type: New Feature
> Components: Java
> Affects Versions: 1.8.0
> Reporter: wesleydeng
> Assignee: wesleydeng
> Priority: Major
> Fix For: 1.8.0
>
>
> for query engine like presto,stripe is the base unit for query concurrency,
> one stripe can only be processed by one split.
> In current implement of orc writer, the only config which can control row
> count in stripe is the "orc.stripe.size".
> But for different kind of table, the row count is difficult to use.
> * for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
> * for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
> for presto, normal olap query only read a subset of table columns, the row
> count is the key factor of query performance. If one stripe contain much
> rows, the query performance may become too low.
> So, besides the config "orc.stripe.size", we need another config like
> "orc.stripe.row.count" to control the row count of one stripe.
> The similar config has been introduced to cudf ( a GPU DataFrame library base
> on apache arrow):
> [rapidsai/cudf#9261|https://github.com/rapidsai/cudf/issues/9261]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)