thomasrebele commented on code in PR #62:
URL: https://github.com/apache/hive-site/pull/62#discussion_r2348029193


##########
content/docs/latest/user/streaming-data-ingest-v2.md:
##########
@@ -79,15 +79,15 @@ HiveStreamingConnection API also supports 2 partitioning 
mode (static vs dynamic
 
 Transactions are implemented slightly differently than traditional database 
systems. Each transaction has an id and multiple transactions are grouped into 
a “Transaction Batch”. This helps grouping records from multiple transactions 
into fewer files (rather than 1 file per transaction). During hive streaming 
connection creation, transaction batch size can be specified via builder API. 
Transaction management is completely hidden behind the API, in most cases users 
do not have to worry about tuning the transaction batch size (which is an 
expert level setting and might not be honored in future release). Also the API 
automatically rolls over to next transaction batch on beginTransaction() 
invocation if the current transaction batch is exhausted. The recommendation is 
to leave the transaction batch size at default value of 1 and group several 
thousands records together under a each transaction. Since each transaction 
corresponds to a delta directory in the filesystem, committing tra
 nsaction too often can end up creating too many small directories. 
 
-Transactions in a TransactionBatch are eventually expired by the Metastore if 
not committed or aborted after 
[hive.txn.timeout](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions)
 secs. In order to keep the transactions alive, HiveStreamingConnection has a 
heartbeater thread which by default sends heartbeat after (hive.txn.timeout/2) 
intervals for all the open transactions. 
+Transactions in a TransactionBatch are eventually expired by the Metastore if 
not committed or aborted after 
[hive.txn.timeout](https://hive.apache.org/docs/latest/user/hive-transactions#new-configuration-parameters-for-transactions)
 secs. In order to keep the transactions alive, HiveStreamingConnection has a 
heartbeater thread which by default sends heartbeat after (hive.txn.timeout/2) 
intervals for all the open transactions. 
 
 See the [Javadoc for 
HiveStreamingConnection](http://hive.apache.org/javadocs/r3.0.0/api/org/apache/hive/streaming/HiveStreamingConnection.html)
 for more information. 
 
 #### Usage Guidelines
 
 Generally, the more records are included in each transaction the more 
throughput can be achieved.  It's common to commit either after a certain 
number of records or after a certain time interval, whichever comes first.  The 
later ensures that when event flow rate is variable, transactions don't stay 
open too long.  There is no practical limit on how much data can be included in 
a single transaction. The only concern is amount of data which will need to be 
replayed if the transaction fails. The concept of a TransactionBatch serves to 
reduce the number of files (and delta directories) created by 
HiveStreamingConnection API in the filesystem. Since all transactions in a 
given transaction batch write to the same physical file (per bucket), a 
partition can only be compacted up to the the level of the earliest transaction 
of any batch which contains an open transaction.  Thus TransactionBatches 
should not be made excessively large.  It makes sense to include a timer to 
close a Transa
 ctionBatch (even if it has unused transactions) after some amount of time.
 
-The HiveStreamingConnection is highly optimized for write throughput ([Delta 
Streaming 
Optimizations](http://hive.apache.org/javadocs/r3.0.0/api/org/apache/hive/streaming/HiveStreamingConnection.Builder.html#withStreamingOptimizations-boolean-))
 and as a result the delta files generated by Hive streaming ingest have many 
of the ORC features disabled (dictionary encoding, indexes, compression, etc.) 
to facilitate high throughput writes. When the compactor kicks in, these delta 
files get rewritten into read- and storage-optimized ORC format (enable 
dictionary encoding, indexes and compression). So it is recommended to 
configure the compactor more aggressively/frequently (refer to 
[Compactor](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Compactor))
 to generate compacted and optimized ORC files.
+The HiveStreamingConnection is highly optimized for write throughput ([Delta 
Streaming 
Optimizations](http://hive.apache.org/javadocs/r3.0.0/api/org/apache/hive/streaming/HiveStreamingConnection.Builder.html#withStreamingOptimizations-boolean-))
 and as a result the delta files generated by Hive streaming ingest have many 
of the ORC features disabled (dictionary encoding, indexes, compression, etc.) 
to facilitate high throughput writes. When the compactor kicks in, these delta 
files get rewritten into read- and storage-optimized ORC format (enable 
dictionary encoding, indexes and compression). So it is recommended to 
configure the compactor more aggressively/frequently (refer to 
[Compactor](https://hive.apache.org/docs/latest/user/hive-transactions#compactor))
 to generate compacted and optimized ORC files.

Review Comment:
   I've created [HIVE-29199](https://issues.apache.org/jira/browse/HIVE-29199).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to