[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

Roshan Naik (JIRA) Tue, 29 Oct 2013 15:36:37 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808548#comment-13808548
 ]


Roshan Naik commented on HIVE-4196:
-----------------------------------

Moving the streaming work to a new jira HIVE-5687 since it will be based on a 
different design.

> Support for Streaming Partitions in Hive
> ----------------------------------------
>
>                 Key: HIVE-4196
>                 URL: https://issues.apache.org/jira/browse/HIVE-4196
>             Project: Hive
>          Issue Type: New Feature
>          Components: Database/Schema, HCatalog
>    Affects Versions: 0.10.1
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>         Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
> apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
> apr 29- patch1.pdf, HIVE-4196.v1.patch
>
>
> Motivation: Allow Hive users to immediately query data streaming in through 
> clients such as Flume.
> Currently Hive partitions must be created after all the data for the 
> partition is available. Thereafter, data in the partitions is considered 
> immutable. 
> This proposal introduces the notion of a streaming partition into which new 
> files an be committed periodically and made available for queries before the 
> partition is closed and converted into a standard partition.
> The admin enables streaming partition on a table using DDL. He provides the 
> following pieces of information:
> - Name of the partition in the table on which streaming is enabled
> - Frequency at which the streaming partition should be closed and converted 
> into a standard partition.
> Tables with streaming partition enabled will be partitioned by one and only 
> one column. It is assumed that this column will contain a timestamp.
> Closing the current streaming partition converts it into a standard 
> partition. Based on the specified frequency, the current streaming partition  
> is closed and a new one created for future writes. This is referred to as 
> 'rolling the partition'.
> A streaming partition's life cycle is as follows:
>  - A new streaming partition is instantiated for writes
>  - Streaming clients request (via webhcat) for a HDFS file name into which 
> they can write a chunk of records for a specific table.
>  - Streaming clients write a chunk (via webhdfs) to that file and commit 
> it(via webhcat). Committing merely indicates that the chunk has been written 
> completely and ready for serving queries.  
>  - When the partition is rolled, all committed chunks are swept into single 
> directory and a standard partition pointing to that directory is created. The 
> streaming partition is closed and new streaming partition is created. Rolling 
> the partition is atomic. Streaming clients are agnostic of partition rolling. 
>  
>  - Hive queries will be able to query the partition that is currently open 
> for streaming. only committed chunks will be visible. read consistency will 
> be ensured so that repeated reads of the same partition will be idempotent 
> for the lifespan of the query.
> Partition rolling requires an active agent/thread running to check when it is 
> time to roll and trigger the roll. This could be either be achieved by using 
> an external agent such as Oozie (preferably) or an internal agent.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

Reply via email to