[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-10-29 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808548#comment-13808548
 ] 

Roshan Naik commented on HIVE-4196:
---

Moving the streaming work to a new jira HIVE-5687 since it will be based on a 
different design.

 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.pdf, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-09-24 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13776952#comment-13776952
 ] 

Roshan Naik commented on HIVE-4196:
---

Thanks Ashutosh. Since your recommendations apply to subtask HIVE-5138, I have 
copied ur comments over to it. I will address them there.

 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.pdf, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-09-17 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770314#comment-13770314
 ] 

Ashutosh Chauhan commented on HIVE-4196:


Few high level comments:
* We should try to eliminate the need of intermediate staging area while 
rolling on new partitions. Seems like there should not be any gotchas while 
moving data from streaming dir to partition dir directly.
* We should make thrift apis in metastore forward compatible. One way to do 
that is to use struct (which contains all parameters) instead of passing in 
list of arguments. 
* We should try to leave TBLS table untouched in backend db. That will simplify 
upgrade story. One way to do that is to have all new columns in a new table and 
than add constraints for this new table.

 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.pdf, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-08-28 Thread Roshan Naik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752946#comment-13752946
 ] 

Roshan Naik commented on HIVE-4196:
---


{quote}  According to the Hive coding conventions lines should be bounded at 
100 characters. Many lines in this patch exceed that. {quote}

Will fix the ones which are not in the thrift generated files.

{quote} I'm surprised to see that streamingStatus sets the chunk id for the 
table. {quote}

Seems like a bug. Will fix.

{quote}  The logic at the end of of these functions doesn't look right. Take 
getNextChunkID for example. If commitTransaction fails (line 2132) rollback 
will be called but the next chunk id will still be returned. It seems you need 
a check on success after commit. I realize many of the calls in the class 
follow this, but it doesn't seem right. {quote}

Good catch. At the time I thought commitTxn() will only fail with an exception 
 does not return false. But on closer inspection there is indeed a corner case 
(if rollBack was called) that it returns false also. Its a bizzare thing for a 
function to fail with  without exceptions. But for now I will fix my code to 
live with it.

{quote} In HiveMetaStoreClient.java, is assert what you want? Are you ok with 
the validity of the arguments not being checked most of the time?{quote}

Not all checks are in place. There is some checks that will happen at lower 
layers. Some at higher. Will be adding more checks.


{quote} I'm trying to figure out whether the chunk files are moved, deleted, or 
left alone during the partition rolling. {quote}

That would depend on whether the table is defined to be an external or internal 
table. It is essentially an add_partition of the new partition. It calls 
HiveMetastore.add_partition_core_notxn()  inside a transaction.



 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.pdf, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think 

[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-07-19 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714235#comment-13714235
 ] 

Alan Gates commented on HIVE-4196:
--

Some comments:

According to the Hive coding conventions lines should be bounded at 100 
characters.  Many lines in this patch exceed that.

In ObjectStore.java:
* I'm surprised to see that streamingStatus sets the chunk id for the table.  
This seems to be a status call.  Why should it be setting chunk id?
* The logic at the end of of these functions doesn't look right.  Take 
getNextChunkID for example.  If commitTransaction fails (line 2132) rollback 
will be called but the next chunk id will still be returned.  It seems you need 
a check on success after commit.  I realize many of the calls in the class 
follow this, but it doesn't seem right.

In HiveMetaStoreClient.java, is assert what you want?  Are you ok with the 
validity of the arguments not being checked most of the time?

I'm trying to figure out whether the chunk files are moved, deleted, or left 
alone during the partition rolling.  From examining the code and playing with 
Hive it looks like the files will be left alone.  But have you tested this?

Which leads to, I don't see any tests in this patch.  This code needs a lot of 
tests.


 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.pdf, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-05-29 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669853#comment-13669853
 ] 

Alan Gates commented on HIVE-4196:
--

Roshan,

Could you post a pdf version of the doc so users without MS Word can read it?  
Also, could you post a version of the patch without the thrift generated code 
(anything under src/gen or src-gen) so it's easier for the reviewers to 
determine what to review?

 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik
 Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- 
 apr 29- patch1.docx, HIVE-4196.v1.patch


 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-03-21 Thread eric baldeschwieler (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608800#comment-13608800
 ] 

eric baldeschwieler commented on HIVE-4196:
---

Maybe we should just return both?





 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik

 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4196) Support for Streaming Partitions in Hive

2013-03-19 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606960#comment-13606960
 ] 

Brock Noland commented on HIVE-4196:


Hi Roshan,

Looks like a good proposal and a great place for Flume to integrate with Hive!  
In the proposal how come we have the clients using webhdfs to write a chunk of 
data? Couldn't the client user any HDFS api?

Brock

 Support for Streaming Partitions in Hive
 

 Key: HIVE-4196
 URL: https://issues.apache.org/jira/browse/HIVE-4196
 Project: Hive
  Issue Type: New Feature
  Components: Database/Schema, HCatalog
Affects Versions: 0.10.1
Reporter: Roshan Naik
Assignee: Roshan Naik

 Motivation: Allow Hive users to immediately query data streaming in through 
 clients such as Flume.
 Currently Hive partitions must be created after all the data for the 
 partition is available. Thereafter, data in the partitions is considered 
 immutable. 
 This proposal introduces the notion of a streaming partition into which new 
 files an be committed periodically and made available for queries before the 
 partition is closed and converted into a standard partition.
 The admin enables streaming partition on a table using DDL. He provides the 
 following pieces of information:
 - Name of the partition in the table on which streaming is enabled
 - Frequency at which the streaming partition should be closed and converted 
 into a standard partition.
 Tables with streaming partition enabled will be partitioned by one and only 
 one column. It is assumed that this column will contain a timestamp.
 Closing the current streaming partition converts it into a standard 
 partition. Based on the specified frequency, the current streaming partition  
 is closed and a new one created for future writes. This is referred to as 
 'rolling the partition'.
 A streaming partition's life cycle is as follows:
  - A new streaming partition is instantiated for writes
  - Streaming clients request (via webhcat) for a HDFS file name into which 
 they can write a chunk of records for a specific table.
  - Streaming clients write a chunk (via webhdfs) to that file and commit 
 it(via webhcat). Committing merely indicates that the chunk has been written 
 completely and ready for serving queries.  
  - When the partition is rolled, all committed chunks are swept into single 
 directory and a standard partition pointing to that directory is created. The 
 streaming partition is closed and new streaming partition is created. Rolling 
 the partition is atomic. Streaming clients are agnostic of partition rolling. 
  
  - Hive queries will be able to query the partition that is currently open 
 for streaming. only committed chunks will be visible. read consistency will 
 be ensured so that repeated reads of the same partition will be idempotent 
 for the lifespan of the query.
 Partition rolling requires an active agent/thread running to check when it is 
 time to roll and trigger the roll. This could be either be achieved by using 
 an external agent such as Oozie (preferably) or an internal agent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira