[ 
https://issues.apache.org/jira/browse/HIVE-6234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Hanson updated HIVE-6234:
------------------------------

    Attachment: HIVE-6234.02.patch

> Implement fast vectorized InputFormat extension for text files
> --------------------------------------------------------------
>
>                 Key: HIVE-6234
>                 URL: https://issues.apache.org/jira/browse/HIVE-6234
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>            Assignee: Eric Hanson
>         Attachments: HIVE-6234.02.patch, Vectorized Text InputFormat 
> design.docx, Vectorized Text InputFormat design.pdf
>
>
> Implement support for vectorized scan input of text files (plain text with 
> configurable record and field separators). This should work for CSV files, 
> tab delimited files, etc. 
> The goal is to provide high-performance reading of these files using 
> vectorized scans, and also to do it as an extension of existing Hive. Then, 
> if vectorized query is enabled, existing tables based on text files will be 
> able to benefit immediately without the need to use a different input format. 
> After upgrading to new Hive bits that support this, faster, vectorized 
> processing over existing text tables should just work, when vectorization is 
> enabled.
> Another goal is to go beyond a simple layering of vectorized row batch 
> iterator over the top of the existing row iterator. It should be possible to, 
> say, read a chunk of data into a byte buffer (several thousand or even 
> million rows), and then read data from it into vectorized row batches 
> directly. Object creations should be minimized to save allocation time and GC 
> overhead. If it is possible to save CPU for values like dates and numbers by 
> caching the translation from string to the final data type, that should 
> ideally be implemented.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to