[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

GitBox Mon, 20 Jan 2020 18:40:04 -0800

lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds 
guidelines on deployment/upgrading
URL: https://github.com/apache/incubator-hudi/pull/1261#discussion_r368788488


 ##########
 File path: docs/_docs/2_6_deployment.md
 ##########
 @@ -1,51 +1,87 @@
 ---
-title: Administering Hudi Pipelines
-keywords: hudi, administration, operation, devops
-permalink: /docs/admin_guide.html
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+title: Deployment Guide
+keywords: hudi, administration, operation, devops, deployment
+permalink: /docs/deployment.html
+summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi
 toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+Specifically, we will cover the following aspects.
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
+ - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
+ 
+## Deploying
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
+using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
-## Admin CLI
 
-Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
-A hudi dataset resides on DFS, in a location referred to as the **basePath** 
and we would need this location in order to connect to a Hudi dataset.
-Hudi library effectively manages this dataset internally, using .hoodie 
subfolder to track all metadata
+## Upgrading 
+
+New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
+At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
+
+As general guidelines, 
+
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
+   However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
+ - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
+   any issues by rolling forward/back the writer code (which typically you 
might have more control over)
+ - With large, feature rich releases we recommend migrating slowly, by first 
testing in staging environments and running your own tests. Upgrading Hudi is 
no different than upgrading any database system.
+
+Note that release notes can override this information with specific 
instructions, applicable on case-by-case basis.
+
+## Migrating
+
+Currently migrating to Hudi can be done using two approaches 
+
+- **Convert newer partitions to Hudi** : This model is suitable for large 
event tables (e.g: click streams, ad impressions), which also typically receive 
writes for the last few days alone. You can convert the last 
+   N partitions to Hudi and proceed writing as if it were a Hudi table to 
begin with. The Hudi query side code is able to correctly handle both hudi and 
non-hudi data partitions.
+- **Full conversion to Hudi** : This model is suitable if you are currently 
bulk/full loading the table few times a day (e.g database ingestion). The full 
conversion of Hudi is simply a one-time step (akin to 1 run of your existing 
job),
+   which moves all of the data into the Hudi format and provides the ability 
to incrementally update for future writes.
+
+For more details, refer to the detailed [migration 
guide](/docs/migration_guide.html). In the future, we will be supporting 
seamless zero-copy bootstrap of existing tables with all the upsert/incremental 
query capabilities fully supported.
+
+## CLI
+
+Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and 
+we would need this location in order to connect to a Hudi table. Hudi library 
effectively manages this table internally, using `.hoodie` subfolder to track 
all metadata
 
 Review comment:
   Hi, miss `.` at the end of statement.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1261: [HUDI-403] Adds guidelines on deployment/upgrading

Reply via email to