[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

Alan Gates (JIRA) Fri, 19 Jul 2013 16:47:34 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714235#comment-13714235
 ]


Alan Gates commented on HIVE-4196:
----------------------------------

Some comments:

According to the Hive coding conventions lines should be bounded at 100 
characters.  Many lines in this patch exceed that.

In ObjectStore.java:
* I'm surprised to see that streamingStatus sets the chunk id for the table.  
This seems to be a status call.  Why should it be setting chunk id?
* The logic at the end of of these functions doesn't look right.  Take 
getNextChunkID for example.  If commitTransaction fails (line 2132) rollback 
will be called but the next chunk id will still be returned.  It seems you need 
a check on success after commit.  I realize many of the calls in the class 
follow this, but it doesn't seem right.

In HiveMetaStoreClient.java, is assert what you want?  Are you ok with the 
validity of the arguments not being checked most of the time?

I'm trying to figure out whether the chunk files are moved, deleted, or left 
alone during the partition rolling.  From examining the code and playing with 
Hive it looks like the files will be left alone.  But have you tested this?

Which leads to, I don't see any tests in this patch.  This code needs a lot of 
tests.

                
> Support for Streaming Partitions in Hive
> ----------------------------------------
>
>                 Key: HIVE-4196
>                 URL: https://issues.apache.org/jira/browse/HIVE-4196
>             Project: Hive
>          Issue Type: New Feature
>          Components: Database/Schema, HCatalog
>    Affects Versions: 0.10.1
>            Reporter: Roshan Naik
>            Assignee: Roshan Naik
>         Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
> apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
> apr 29- patch1.pdf, HIVE-4196.v1.patch
>
>
> Motivation: Allow Hive users to immediately query data streaming in through 
> clients such as Flume.
> Currently Hive partitions must be created after all the data for the 
> partition is available. Thereafter, data in the partitions is considered 
> immutable. 
> This proposal introduces the notion of a streaming partition into which new 
> files an be committed periodically and made available for queries before the 
> partition is closed and converted into a standard partition.
> The admin enables streaming partition on a table using DDL. He provides the 
> following pieces of information:
> - Name of the partition in the table on which streaming is enabled
> - Frequency at which the streaming partition should be closed and converted 
> into a standard partition.
> Tables with streaming partition enabled will be partitioned by one and only 
> one column. It is assumed that this column will contain a timestamp.
> Closing the current streaming partition converts it into a standard 
> partition. Based on the specified frequency, the current streaming partition  
> is closed and a new one created for future writes. This is referred to as 
> 'rolling the partition'.
> A streaming partition's life cycle is as follows:
>  - A new streaming partition is instantiated for writes
>  - Streaming clients request (via webhcat) for a HDFS file name into which 
> they can write a chunk of records for a specific table.
>  - Streaming clients write a chunk (via webhdfs) to that file and commit 
> it(via webhcat). Committing merely indicates that the chunk has been written 
> completely and ready for serving queries.  
>  - When the partition is rolled, all committed chunks are swept into single 
> directory and a standard partition pointing to that directory is created. The 
> streaming partition is closed and new streaming partition is created. Rolling 
> the partition is atomic. Streaming clients are agnostic of partition rolling. 
>  
>  - Hive queries will be able to query the partition that is currently open 
> for streaming. only committed chunks will be visible. read consistency will 
> be ensured so that repeated reads of the same partition will be idempotent 
> for the lifespan of the query.
> Partition rolling requires an active agent/thread running to check when it is 
> time to roll and trigger the roll. This could be either be achieved by using 
> an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

Reply via email to