[ 
https://issues.apache.org/jira/browse/PARQUET-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584388#comment-17584388
 ] 

Micah Kornfield commented on PARQUET-2175:
------------------------------------------

I think the current signature is 
[Skip(num_rows_to_skip)|https://github.com/apache/arrow/blob/545b4313d6db2dfcc4ea0aa4ac23785d64450e1d/cpp/src/parquet/column_reader.h#L223]
 which is why this is confusing.  The docs seem accurate.  Given the accurate 
documents (although they can probably be clarified), I think a new SkipRows 
method makes sense and we should rename the variable as you suggested.

> Skip method skips levels and not rows for repeated fields
> ---------------------------------------------------------
>
>                 Key: PARQUET-2175
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2175
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: fatemah
>            Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The implementation of TypedColumnReader::Skip method with signature:
> virtual int64_t Skip(int64_t num_levels_to_skip) = 0;
> will skip levels for both repeated fields and non-repeated fields. We want to 
> be able to skip rows for repeated fields, and skipping levels is not that 
> useful.
> For example, for the following rows:
> message M \{ repeated int32 b = 1 }
> rows: {}, \{[10,10]}, \{[20, 20, 20]}
> values = \{10, 10, 20, 20, 20};
> def_levels = \{0, 1, 1, 1, 1, 1};
> rep_levels = \{0, 0, 1, 0, 1, 1};
> We want skip(2) to skip the first two rows, so that the next value that we 
> read is 20. However, it will skip the first two levels, and the next value 
> that we read is 10.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to