[ 
https://issues.apache.org/jira/browse/PARQUET-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584171#comment-14584171
 ] 

Nezih Yigitbasi commented on PARQUET-299:
-----------------------------------------

[~dongc] aha I now got what you mean. You are right that currently it may 
return more than 1K rows except for the last page of data. The problem is, a 
data page holds the data as a `BytesInput` and with that we cannot know the 
element boundaries without decoding the page. So to really exactly load 1K rows 
we have to decode a page and only read as many rows as necessary, and this will 
incur some performance penalty due to decoding on the read path (and we have 
some plans to support lazy decoding so this also conflicts with that). I guess 
you better should handle this at Hive side and keep an internal pointer in the 
vector that the vectorized reader returns, and after you read you only push 
exactly 1K rows to further stages of processing.

> [Vectorized Reader] ColumnVector length should be in terms of rows, not 
> DataPages
> ---------------------------------------------------------------------------------
>
>                 Key: PARQUET-299
>                 URL: https://issues.apache.org/jira/browse/PARQUET-299
>             Project: Parquet
>          Issue Type: Sub-task
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>
> In https://github.com/zhenxiao/incubator-parquet-mr/tree/vector
> ColumnVector length is in terms of DataPages, need to be in terms of rows



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to