[ 
https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206579#comment-14206579
 ] 

Brock Noland commented on PARQUET-131:
--------------------------------------

Hi,

Thank you very much for creating this! I sincerely appreciate you taking the 
type to create this proposal!

>From the Hive side, I have the following feedback:

My understanding is that {{ColumnVector}} is an interface so we can provide our 
own impl. This will be required for Hive since we have our own {{ColumnVector}} 
impl and it's extremely widely used. I don't think this version of the 
{{ColumnVector}} interface will provide pluggability for the following reasons:

# Impls e.g {{LongVector}} have public members. This same thing was done in 
Hive (not use getters and setters) but IMO for dubious reasons. No proof was 
provided that shows JIT does not optimize the getters setters out.
# Drill, Hive, etc will be required to extend {{LongVector}} in order to make 
this work, but that would require massive change on the Hive side. We should 
provide getters and setters on the interface for the data types so that Hive 
can simply implement the {{ColumnVector}} interface with our existing 
implementation. We might also need to provide {{isLongVector}} methods so we 
know the type of the {{ColumnVector}}.
# I don't understanding why {{ColumnVector}} has an {{getEncoding}}. Isn't an 
encoding a storage feature not a column vector feature?

> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL 
> engines like Hive, Drill, and Presto. Instead of processing one row at a 
> time, Vectorized Query Execution could streamline operations by processing a 
> batch of rows at a time. Within one batch, each column is represented as a 
> vector of a primitive data type. SQL engines could apply predicates very 
> efficiently on these vectors, avoiding a single row going through all the 
> operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet 
> could support Vectorized APIs, so that all SQL engines could read vectors 
> from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to