[
https://issues.apache.org/jira/browse/PARQUET-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584388#comment-17584388
]
Micah Kornfield commented on PARQUET-2175:
------------------------------------------
I think the current signature is
[Skip(num_rows_to_skip)|https://github.com/apache/arrow/blob/545b4313d6db2dfcc4ea0aa4ac23785d64450e1d/cpp/src/parquet/column_reader.h#L223]
which is why this is confusing. The docs seem accurate. Given the accurate
documents (although they can probably be clarified), I think a new SkipRows
method makes sense and we should rename the variable as you suggested.
> Skip method skips levels and not rows for repeated fields
> ---------------------------------------------------------
>
> Key: PARQUET-2175
> URL: https://issues.apache.org/jira/browse/PARQUET-2175
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: fatemah
> Priority: Major
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> The implementation of TypedColumnReader::Skip method with signature:
> virtual int64_t Skip(int64_t num_levels_to_skip) = 0;
> will skip levels for both repeated fields and non-repeated fields. We want to
> be able to skip rows for repeated fields, and skipping levels is not that
> useful.
> For example, for the following rows:
> message M \{ repeated int32 b = 1 }
> rows: {}, \{[10,10]}, \{[20, 20, 20]}
> values = \{10, 10, 20, 20, 20};
> def_levels = \{0, 1, 1, 1, 1, 1};
> rep_levels = \{0, 0, 1, 0, 1, 1};
> We want skip(2) to skip the first two rows, so that the next value that we
> read is 20. However, it will skip the first two levels, and the next value
> that we read is 10.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)