This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 94f64a7 Revised community, contributing pages
94f64a7 is described below
commit 94f64a7c7e8c8ec0307fbd78c5ec13ec0a7b9175
Author: Vinoth Chandar <[email protected]>
AuthorDate: Mon Feb 25 07:01:53 2019 -0800
Revised community, contributing pages
- Community engagement instructions
- Strawman contribution guide, to get us going
- Fixed broken image urls from the hudi renames
- Fixed broken code formatting on couple pages
- Removed api_setup, roadmap pages and cleaned up structure
---
.gitignore | 1 +
docs/README.md | 5 +
docs/_config.yml | 2 +-
docs/_data/topnav.yml | 24 ++-
docs/_includes/footer.html | 6 +
docs/_posts/2019-01-18-asf-incubation.md | 10 ++
docs/admin_guide.md | 22 ++-
docs/api_docs.md | 10 --
docs/code_and_design.md | 38 -----
docs/community.md | 38 +++--
docs/concepts.md | 28 ++--
docs/configurations.md | 38 +++--
docs/contributing.md | 101 +++++++++++++
docs/dev_setup.md | 13 --
docs/images/hoodie_cow.png | Bin 31136 -> 0 bytes
docs/images/hoodie_mor.png | Bin 56002 -> 0 bytes
docs/images/hudi_cow.png | Bin 0 -> 48994 bytes
docs/images/hudi_mor.png | Bin 0 -> 92073 bytes
.../{hoodie_timeline.png => hudi_timeline.png} | Bin
docs/implementation.md | 165 +++++++++++----------
docs/index.md | 7 +-
docs/migration_guide.md | 70 ++++-----
docs/quickstart.md | 89 +++++------
docs/roadmap.md | 14 --
docs/sql_queries.md | 5 +-
25 files changed, 383 insertions(+), 303 deletions(-)
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..e43b0f9
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+.DS_Store
diff --git a/docs/README.md b/docs/README.md
index 0995250..8593206 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -11,6 +11,11 @@ The site is based on a [Jekyll](https://jekyllrb.com/) theme
hosted [here](idrat
Simply run `docker-compose build --no-cache && docker-compose up` from the
`docs` folder and the site should be up & running at `http://localhost:4000`
+To see edits reflect on the site, you may have to bounce the container
+
+ - Stop existing container by `ctrl+c` the docker-compose program
+ - (or) alternatively via `docker stop docs_server_1`
+ - Bring up container again using `docker-compose up`
#### Host OS
diff --git a/docs/_config.yml b/docs/_config.yml
index 781bdb6..9f0effd 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -77,7 +77,7 @@ defaults:
sidebars:
- mydoc_sidebar
-description: "Apache Hudi (pronounced “Hoodie”) is a Spark Library, that
provides upserts and incremental processing capaibilities on Hadoop datasets"
+description: "Apache Hudi (pronounced “Hoodie”) provides upserts and
incremental processing capaibilities on Big Data"
# the description is used in the feed.xml file
# needed for sitemap.xml file only
diff --git a/docs/_data/topnav.yml b/docs/_data/topnav.yml
index 190573a..0042feb 100644
--- a/docs/_data/topnav.yml
+++ b/docs/_data/topnav.yml
@@ -7,24 +7,22 @@ topnav:
url: /news
- title: Community
url: /community.html
- - title: Github
+ - title: Code
external_url: https://github.com/uber/hoodie
#Topnav dropdowns
topnav_dropdowns:
- title: Topnav dropdowns
folders:
- - title: Developer Resources
+ - title: Developers
folderitems:
- - title: Setup
- url: /dev_setup.html
- output: web
- - title: API Docs
- url: /api_docs.html
- output: web
- - title: Code Structure
- url: /code_and_design.html
- output: web
- - title: Roadmap
- url: /roadmap.html
+ - title: Contributing
+ url: /contributing.html
output: web
+ - title: Wiki/Designs
+ external_url: https://cwiki.apache.org/confluence/display/HUDI
+ - title: Issues
+ external_url: https://issues.apache.org/jira/projects/HUDI/summary
+ - title: Blog
+ external_url:
https://cwiki.apache.org/confluence/pages/viewrecentblogposts.action?key=HUDI
+
diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html
index 00605db..c920c5c 100755
--- a/docs/_includes/footer.html
+++ b/docs/_includes/footer.html
@@ -8,6 +8,12 @@
<a class="footer-link-img" href="https://apache.org">
<img src="images/asf_logo.svg" alt="The Apache Software
Foundation" height="100px" widh="50px"></a>
</p>
+ <p>
+ Apache Hudi is an effort undergoing incubation at The Apache
Software Foundation (ASF), sponsored by the name of <a
href="http://incubator.apache.org/">Apache Incubator</a>.
+ Incubation is required of all newly accepted projects until
a further review indicates that the infrastructure, communications, and
decision making process have
+ stabilized in a manner consistent with other successful ASF
projects. While incubation status is not necessarily a
+ reflection of the completeness or stability of the code, it
does indicate that the project has yet to be fully endorsed by the ASF.
+ </p>
</div>
</div>
</footer>
diff --git a/docs/_posts/2019-01-18-asf-incubation.md
b/docs/_posts/2019-01-18-asf-incubation.md
new file mode 100644
index 0000000..79de37c
--- /dev/null
+++ b/docs/_posts/2019-01-18-asf-incubation.md
@@ -0,0 +1,10 @@
+---
+title: "Hudi entered Apache Incubator"
+categories: update
+permalink: strata-talk.html
+tags: [news]
+---
+
+In the coming weeks, we will be moving in our new home on the Apache Incubator.
+
+{% include links.html %}
diff --git a/docs/admin_guide.md b/docs/admin_guide.md
index 7f7e610..3d37d22 100644
--- a/docs/admin_guide.md
+++ b/docs/admin_guide.md
@@ -43,7 +43,9 @@ hoodie->create --path /user/hive/warehouse/table1 --tableName
hoodie_table_1 --t
```
To see the description of hoodie table, use the command:
+
```
+
hoodie:hoodie_table_1->desc
18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline: Loaded instants []
_________________________________________________________
@@ -55,6 +57,7 @@ hoodie:hoodie_table_1->desc
| hoodie.table.name | hoodie_table_1 |
| hoodie.table.type | COPY_ON_WRITE |
| hoodie.archivelog.folder| |
+
```
Following is a sample command to connect to a Hoodie dataset contains uber
trips.
@@ -183,7 +186,7 @@ order (See Concepts). The below commands allow users to
view the file-slices for
| Partition | FileId | Base-Instant | Data-File | Data-File Size| Num Delta
Files| Total Delta Size| Delta Size - compaction scheduled| Delta Size -
compaction unscheduled| Delta To Base Ratio - compaction scheduled| Delta To
Base Ratio - compaction unscheduled| Delta Files - compaction scheduled | Delta
Files - compaction unscheduled|
|==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
[...]
| 2018/08/31| 111415c3-f26d-4639-86c8-f9956f245ac3| 20181002180759|
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/111415c3-f26d-4639-86c8-f9956f245ac3_0_20181002180759.parquet|
432.5 KB | 1 | 20.8 KB | 20.8 KB | 0.0 B | 0.0 B | 0.0 B | [HoodieLogFile
{hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/.111415c3-f26d-4639-86c8-f9956f245ac3_20181002180759.log.1}]|
[] |
-
+
hoodie:stock_ticks_mor->
```
@@ -224,7 +227,7 @@ This is a sequence file that contains a mapping from
commitNumber => json with r
#### Compactions
-To get an idea of the lag between compaction and writer applications, use the
below command to list down all
+To get an idea of the lag between compaction and writer applications, use the
below command to list down all
pending compactions.
```
@@ -316,7 +319,7 @@ hoodie:stock_ticks_mor->compaction validate --instant
20181005222611
...
COMPACTION PLAN VALID
-
+
___________________________________________________________________________________________________________________________________________________________________________________________________________________________
| File Id | Base Instant Time| Base Data File
| Num Delta Files| Valid| Error|
|==========================================================================================================================================================================================================================|
@@ -340,14 +343,15 @@ hoodie:stock_ticks_mor->compaction validate --instant
20181005222601
The following commands must be executed without any other writer/ingestion
application running.
-Sometimes, it becomes necessary to remove a fileId from a compaction-plan
inorder to speed-up or unblock compaction
-operation. Any new log-files that happened on this file after the compaction
got scheduled will be safely renamed
+Sometimes, it becomes necessary to remove a fileId from a compaction-plan
inorder to speed-up or unblock compaction
+operation. Any new log-files that happened on this file after the compaction
got scheduled will be safely renamed
so that are preserved. Hudi provides the following CLI to support it
##### UnScheduling Compaction
```
+
hoodie:trips->compaction unscheduleFileId --fileId <FileUUID>
....
No File renames needed to unschedule file from pending compaction. Operation
successful.
@@ -356,24 +360,28 @@ No File renames needed to unschedule file from pending
compaction. Operation suc
In other cases, an entire compaction plan needs to be reverted. This is
supported by the following CLI
```
+
hoodie:trips->compaction unschedule --compactionInstant <compactionInstant>
.....
No File renames needed to unschedule pending compaction. Operation successful.
+
```
-
+
##### Repair Compaction
The above compaction unscheduling operations could sometimes fail partially
(e:g -> HDFS temporarily unavailable). With
-partial failures, the compaction operation could become inconsistent with the
state of file-slices. When you run
+partial failures, the compaction operation could become inconsistent with the
state of file-slices. When you run
`compaction validate`, you can notice invalid compaction operations if there
is one. In these cases, the repair
command comes to the rescue, it will rearrange the file-slices so that there
is no loss and the file-slices are
consistent with the compaction plan
```
+
hoodie:stock_ticks_mor->compaction repair --instant 20181005222611
......
Compaction successfully repaired
.....
+
```
diff --git a/docs/api_docs.md b/docs/api_docs.md
deleted file mode 100644
index 24bfd6b..0000000
--- a/docs/api_docs.md
+++ /dev/null
@@ -1,10 +0,0 @@
----
-title: API Docs
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: api_docs.html
----
-
-Work In Progress
-
-
diff --git a/docs/code_and_design.md b/docs/code_and_design.md
deleted file mode 100644
index 3baaa97..0000000
--- a/docs/code_and_design.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-title: Code Structure
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: code_and_design.html
----
-
-## Code & Project Structure
-
- * hoodie-client : Spark client library to take a bunch of inserts +
updates and apply them to a Hoodie table
- * hoodie-common : Common code shared between different artifacts of Hoodie
-
- ## HoodieLogFormat
-
- The following diagram depicts the LogFormat for Hoodie MergeOnRead. Each
logfile consists of one or more log blocks.
- Each logblock follows the format shown below.
-
- | Field | Description |
- |-------------- |------------------|
- | MAGIC | A magic header that marks the start of a block |
- | VERSION | The version of the LogFormat, this helps define how to switch
between different log format as it evolves |
- | TYPE | The type of the log block |
- | HEADER LENGTH | The length of the headers, 0 if no headers |
- | HEADER | Metadata needed for a log block. For eg. INSTANT_TIME,
TARGET_INSTANT_TIME, SCHEMA etc. |
- | CONTENT LENGTH | The length of the content of the log block |
- | CONTENT | The content of the log block, for example, for a
DATA_BLOCK, the content is (number of records + actual records) in byte [] |
- | FOOTER LENGTH | The length of the footers, 0 if no footers |
- | FOOTER | Metadata needed for a log block. For eg. index entries, a
bloom filter for records in a DATA_BLOCK etc. |
- | LOGBLOCK LENGTH | The total number of bytes written for a log block,
typically the SUM(everything_above). This is a LONG. This acts as a reverse
pointer to be able to traverse the log in reverse.|
-
-
- {% include image.html file="hoodie_log_format_v2.png"
alt="hoodie_log_format_v2.png" %}
-
-
-
-
-
-
diff --git a/docs/community.md b/docs/community.md
index c508191..c16dc92 100644
--- a/docs/community.md
+++ b/docs/community.md
@@ -6,17 +6,35 @@ toc: false
permalink: community.html
---
+## Engage with us
+
+There are several ways to get in touch with the Hudi community.
+
+| When? | Channel to use |
+|-------|--------|
+| For any general questions, user support, development discussions | Dev
Mailing list ([Subscribe](mailto:[email protected]),
[Unsubscribe](mailto:[email protected]),
[Archives](https://lists.apache.org/[email protected])). Empty
email works for subscribe/unsubscribe |
+| For reporting bugs or issues or discover known issues | Please use [ASF Hudi
JIRA](https://issues.apache.org/jira/projects/HUDI/summary) |
+| For quick pings & 1-1 chats | Join our [slack
group](https://join.slack.com/t/apache-hudi/signup) |
+| For proposing large features, changes | Start a Hudi Improvement Process
(HIP). Instructions coming soon.|
+| For stream of commits, pull requests etc | Commits Mailing list
([Subscribe](mailto:[email protected]),
[Unsubscribe](mailto:[email protected]),
[Archives](https://lists.apache.org/[email protected])) |
+
+If you wish to report a security vulnerability, please contact
[[email protected]](mailto:[email protected]).
+Apache Hudi follows the typical Apache vulnerability handling
[process](https://apache.org/security/committers.html#vulnerability-handling).
+
## Contributing
-We :heart: contributions. If you find a bug in the library or would like to
add new features, go ahead and open
-issues or pull requests against this repo. Before you do so, please sign the
-[Apache CLA](https://www.apache.org/licenses/icla.pdf).
-Also, be sure to write unit tests for your bug fix or feature to show that it
works as expected.
-If the reviewer feels this contributions needs to be in the release notes,
please add it to CHANGELOG.md as well.
-If you want to participate in day-day conversations, please join our [slack
group](https://join.slack.com/t/apache-hudi/signup).
-If you are from select pre-listed email domains, you can self signup. Others,
please subscribe to [email protected]
+Apache Hudi community welcomes contributions from anyone!
+
+Here are few ways, you can get involved.
+
+ - Ask (and/or) answer questions on our support channels listed above.
+ - Review code or HIPs
+ - Help improve documentation
+ - Testing; Improving out-of-box experience by reporting bugs
+ - Share new ideas/directions to pursue or propose a new HIP
+ - Contributing code to the project
-## Becoming a Committer
+#### Code Contributions
-Hoodie has adopted a lot of guidelines set forth in [Google Chromium
project](https://www.chromium.org/getting-involved/become-a-committer), to
determine committership proposals. However, given this is a much younger
project, we would have the contribution bar to be 10-15 non-trivial patches
instead.
-Additionally, we expect active engagement with the community over a few
months, in terms of conference/meetup talks, helping out with issues/questions
on slack/github.
+Useful resources for contributing can be found under the "Developers" top menu.
+Specifically, please refer to the detailed [contribution
guide](contributing.html).
diff --git a/docs/concepts.md b/docs/concepts.md
index 5ce3fc6..845228a 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -20,7 +20,7 @@ Such key activities include
* `COMMITS` - A single commit captures information about an **atomic write**
of a batch of records into a dataset.
Commits are identified by a monotonically increasing timestamp,
denoting the start of the write operation.
* `CLEANS` - Background activity that gets rid of older versions of files in
the dataset, that are no longer needed.
- * `DELTA_COMMITS` - A single commit captures information about an **atomic
write** of a batch of records into a
+ * `DELTA_COMMITS` - A single commit captures information about an **atomic
write** of a batch of records into a
MergeOnRead storage type of dataset
* `COMPACTIONS` - Background activity to reconcile differential data
structures within Hudi e.g: moving updates from row based log files to columnar
formats.
@@ -37,15 +37,15 @@ only the changed files without say scanning all the time
buckets > 07:00.
## Terminologies
- * `Hudi Dataset`
- A structured hive/spark dataset managed by Hudi. Hudi supports both
partitioned and non-partitioned Hive tables.
- * `Commit`
- A commit marks a new batch of data applied to a dataset. Hudi maintains
monotonically increasing timestamps to track commits and guarantees that a
commit is atomically
+ * `Hudi Dataset`
+ A structured hive/spark dataset managed by Hudi. Hudi supports both
partitioned and non-partitioned Hive tables.
+ * `Commit`
+ A commit marks a new batch of data applied to a dataset. Hudi maintains
monotonically increasing timestamps to track commits and guarantees that a
commit is atomically
published.
* `Commit Timeline`
- Commit Timeline refers to the sequence of Commits that was applied in
order on a dataset over its lifetime.
- * `File Slice`
- Hudi provides efficient handling of updates by having a fixed mapping
between record key to a logical file Id.
+ Commit Timeline refers to the sequence of Commits that was applied in
order on a dataset over its lifetime.
+ * `File Slice`
+ Hudi provides efficient handling of updates by having a fixed mapping
between record key to a logical file Id.
Hudi uses MVCC to provide atomicity and isolation of readers from a
writer. This means that a logical fileId will
have many physical versions of it. Each of these physical version of a
file represents a complete view of the
file as of a commit and is called File Slice
@@ -69,8 +69,6 @@ Hudi (will) supports the following storage types.
- Copy On Write : A heavily read optimized storage type, that simply creates
new versions of files corresponding to the records that changed.
- Merge On Read : Also provides a near-real time datasets in the order of 5
mins, by shifting some of the write cost, to the reads and merging incoming and
on-disk data on-the-fly
-{% include callout.html content="Hudi is a young project. merge-on-read is
currently underway. Get involved
[here](https://github.com/uber/Hudi/projects/1)" type="info" %}
-
Regardless of the storage type, Hudi organizes a datasets into a directory
structure under a `basepath`,
very similar to Hive tables. Dataset is broken up into partitions, which are
folders containing files for that partition.
Each partition uniquely identified by its `partitionpath`, which is relative
to the basepath.
@@ -92,12 +90,12 @@ commit, such that only columnar data exists. As a result,
the write amplificatio
Following illustrates how this works conceptually, when data written into
copy-on-write storage and two queries running on top of it.
-{% include image.html file="Hudi_cow.png" alt="Hudi_cow.png" %}
+{% include image.html file="hudi_cow.png" alt="hudi_cow.png" %}
As data gets written, updates to existing file ids, produce a new version for
that file id stamped with the commit and
inserts allocate a new file id and write its first version for that file id.
These file versions and their commits are color coded above.
-Normal SQL queries running against such dataset (eg: select count(*) counting
the total records in that partition), first checks the timeline for latest
commit
+Normal SQL queries running against such dataset (eg: `select count(*)`
counting the total records in that partition), first checks the timeline for
latest commit
and filters all but latest versions of each file id. As you can see, an old
query does not see the current inflight commit's files colored in pink,
but a new query starting after the commit picks up the new data. Thus queries
are immune to any write failures/partial writes and only run on committed data.
@@ -118,7 +116,7 @@ their columnar base data, to keep the query performance in
check (larger append
Following illustrates how the storage works, and shows queries on both
near-real time table and read optimized table.
-{% include image.html file="Hudi_mor.png" alt="Hudi_mor.png" max-width="1000"
%}
+{% include image.html file="hudi_mor.png" alt="hudi_mor.png" max-width="1000"
%}
There are lot of interesting things happening in this example, which bring out
the subleties in the approach.
@@ -135,8 +133,6 @@ There are lot of interesting things happening in this
example, which bring out t
strategy, where we aggressively compact the latest partitions compared to
older partitions, we could ensure the RO Table sees data
published within X minutes in a consistent fashion.
-{% include callout.html content="Hudi is a young project. merge-on-read is
currently underway. Get involved
[here](https://github.com/uber/hoodie/projects/1)" type="info" %}
-
The intention of merge on read storage, is to enable near real-time processing
directly on top of Hadoop, as opposed to copying
data out to specialized systems, which may not be able to handle the data
volume.
@@ -156,4 +152,4 @@ data out to specialized systems, which may not be able to
handle the data volume
| Trade-off | ReadOptimized | RealTime |
|-------------- |------------------| ------------------|
| Data Latency | Higher | Lower |
-| Query Latency | Lower (raw columnar performance) | Higher (merge columnar +
row based delta) |
\ No newline at end of file
+| Query Latency | Lower (raw columnar performance) | Higher (merge columnar +
row based delta) |
diff --git a/docs/configurations.md b/docs/configurations.md
index 50a7e5f..e6602e6 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -136,7 +136,7 @@ summary: "Here we list all possible configurations and what
they mean"
Actual value ontained by invoking .toString()</span>
- [KEYGENERATOR_CLASS_OPT_KEY](#KEYGENERATOR_CLASS_OPT_KEY) (Default:
com.uber.hoodie.SimpleKeyGenerator) <br/>
<span style="color:grey">Key generator class, that implements will
extract the key out of incoming `Row` object</span>
- -
[COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY)
(Default: _) <br/>
+ -
[COMMIT_METADATA_KEYPREFIX_OPT_KEY](#COMMIT_METADATA_KEYPREFIX_OPT_KEY)
(Default: `_`) <br/>
<span style="color:grey">Option keys beginning with this prefix, are
automatically added to the commit/deltacommit metadata.
This is useful to store checkpointing information, in a consistent way
with the hoodie timeline</span>
@@ -160,22 +160,33 @@ summary: "Here we list all possible configurations and
what they mean"
Writing data via Hudi happens as a Spark job and thus general rules of spark
debugging applies here too. Below is a list of things to keep in mind, if you
are looking to improving performance or reliability.
- - **Write operations** : Use `bulkinsert` to load new data into a table, and
there on use `upsert`/`insert`.
+**Write operations** : Use `bulkinsert` to load new data into a table, and
there on use `upsert`/`insert`.
Difference between them is that bulk insert uses a disk based write path to
scale to load large inputs without need to cache it.
- - **Input Parallelism** : By default, Hoodie tends to over-partition input
(i.e `withParallelism(1500)`), to ensure each Spark partition stays within the
2GB limit for inputs upto 500GB. Bump this up accordingly if you have larger
inputs. We recommend having shuffle parallelism
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast
input_data_size/500MB
- - **Off-heap memory** : Hoodie writes parquet files and that needs good
amount of off-heap memory proportional to schema width. Consider setting
something like `spark.yarn.executor.memoryOverhead` or
`spark.yarn.driver.memoryOverhead`, if you are running into such failures.
- - **Spark Memory** : Typically, hoodie needs to be able to read a single file
into memory to perform merges or compactions and thus the executor memory
should be sufficient to accomodate this. In addition, Hoodie caches the input
to be able to intelligently place data and thus leaving some
`spark.storage.memoryFraction` will generally help boost performance.
- - **Sizing files** : Set `limitFileSize` above judiciously, to balance
ingest/write latency vs number of files & consequently metadata overhead
associated with it.
- - **Timeseries/Log data** : Default configs are tuned for database/nosql
changelogs where individual record sizes are large. Another very popular class
of data is timeseries/event/log data that tends to be more volumnious with lot
more records per partition. In such cases
+
+**Input Parallelism** : By default, Hoodie tends to over-partition input (i.e
`withParallelism(1500)`), to ensure each Spark partition stays within the 2GB
limit for inputs upto 500GB. Bump this up accordingly if you have larger
inputs. We recommend having shuffle parallelism
`hoodie.[insert|upsert|bulkinsert].shuffle.parallelism` such that its atleast
input_data_size/500MB
+
+**Off-heap memory** : Hoodie writes parquet files and that needs good amount
of off-heap memory proportional to schema width. Consider setting something
like `spark.yarn.executor.memoryOverhead` or
`spark.yarn.driver.memoryOverhead`, if you are running into such failures.
+
+**Spark Memory** : Typically, hoodie needs to be able to read a single file
into memory to perform merges or compactions and thus the executor memory
should be sufficient to accomodate this. In addition, Hoodie caches the input
to be able to intelligently place data and thus leaving some
`spark.storage.memoryFraction` will generally help boost performance.
+
+**Sizing files** : Set `limitFileSize` above judiciously, to balance
ingest/write latency vs number of files & consequently metadata overhead
associated with it.
+
+**Timeseries/Log data** : Default configs are tuned for database/nosql
changelogs where individual record sizes are large. Another very popular class
of data is timeseries/event/log data that tends to be more volumnious with lot
more records per partition. In such cases
- Consider tuning the bloom filter accuracy via
`.bloomFilterFPP()/bloomFilterNumEntries()` to achieve your target index look
up time
- Consider making a key that is prefixed with time of the event, which
will enable range pruning & significantly speeding up index lookup.
- - **GC Tuning** : Please be sure to follow garbage collection tuning tips
from Spark tuning guide to avoid OutOfMemory errors
- - [Must] Use G1/CMS Collector. Sample CMS Flags to add to
spark.executor.extraJavaOptions : ``-XX:NewSize=1g -XX:SurvivorRatio=2
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/ho [...]
- - If it keeps OOMing still, reduce spark memory conservatively:
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to
spill rather than OOM. (reliably slow vs crashing intermittently)
- Below is a full working production config
+**GC Tuning** : Please be sure to follow garbage collection tuning tips from
Spark tuning guide to avoid OutOfMemory errors
+[Must] Use G1/CMS Collector. Sample CMS Flags to add to
spark.executor.extraJavaOptions :
- ```
+```
+-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
+````
+
+If it keeps OOMing still, reduce spark memory conservatively:
`spark.memory.fraction=0.2, spark.memory.storageFraction=0.2` allowing it to
spill rather than OOM. (reliably slow vs crashing intermittently)
+
+Below is a full working production config
+
+```
spark.driver.extraClassPath /etc/hive/conf
spark.driver.extraJavaOptions -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof
spark.driver.maxResultSize 2g
@@ -200,4 +211,5 @@ Writing data via Hudi happens as a Spark job and thus
general rules of spark deb
spark.yarn.driver.memoryOverhead 1024
spark.yarn.executor.memoryOverhead 3072
spark.yarn.max.executor.failures 100
- ```
+
+````
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 0000000..a93ba54
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,101 @@
+---
+title: Developer Setup
+keywords: developer setup
+sidebar: mydoc_sidebar
+toc: false
+permalink: contributing.html
+---
+## Pre-requisites
+
+To contribute code, you need
+
+ - a GitHub account
+ - a Linux (or) macOS development environment with Java JDK 8, Apache Maven
(3.x+) installed
+ - [Docker](https://www.docker.com/) installed for running demo, integ tests
or building website
+ - for large contributions, a signed [Individual Contributor License
+ Agreement](https://www.apache.org/licenses/icla.pdf) (ICLA) to the Apache
+ Software Foundation (ASF).
+ - (Recommended) Create an account on
[JIRA](https://issues.apache.org/jira/projects/HUDI/summary) to open
issues/find similar issues.
+ - (Recommended) Join our dev mailing list & slack channel, listed on
[community](community.html) page.
+
+
+## IDE Setup
+
+To contribute, you would need to fork the Hudi code on Github & then clone
your own fork locally. Once cloned, we recommend building as per instructions
on [quickstart](quickstart.html)
+
+We have embraced the code style largely based on [google
format](https://google.github.io/styleguide/javaguide.html). Please setup your
IDE with style files from [here](../style/).
+These instructions have been tested on IntelliJ. We also recommend setting up
the [Save Action
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format
& organize imports on save. The Maven Compilation life-cycle will fail if there
are checkstyle violations.
+
+
+## Lifecycle
+
+Here's a typical lifecycle of events to contribute to Hudi.
+
+ - [Recommended] Share your intent on the mailing list, so that community can
provide early feedback, point out any similar JIRAs or HIPs.
+ - [Optional] If you want to get involved, but don't have a project in mind,
please check JIRA for small, quick-starters.
+ - [Optional] Familiarize yourself with internals of Hudi using content on
this page, as well as [wiki](https://cwiki.apache.org/confluence/display/HUDI)
+ - Once you finalize on a project/task, please open a new JIRA or assign an
existing one to yourself. (If you don't have perms to do this, please email the
dev mailing list with your JIRA id and a small intro for yourself. We'd be
happy to add you as a contributor)
+ - Make your code change
+ - Every source file needs to include the Apache license header. Every new
dependency needs to
+ have an open source license
[compatible](https://www.apache.org/legal/resolved.html#criteria) with Apache.
+ - Get existing tests to pass using `mvn clean install -DskipITs`
+ - Add adequate tests for your new functionality
+ - [Optional] For involved changes, its best to also run the entire
integration test suite using `mvn clean install`
+ - For website changes, please build the site locally & test navigation,
formatting & links thoroughly
+ - Format commit messages and the pull request title like `[HUDI-XXX] Fixes
bug in Spark Datasource`,
+ where you replace HUDI-XXX with the appropriate JIRA issue.
+ - Push your commit to your own fork/branch & create a pull request (PR)
against the Hudi repo.
+ - If you don't hear back within 3 days on the PR, please send an email to dev
@ mailing list.
+ - Address code review comments & keep pushing changes to your fork/branch,
which automatically updates the PR
+ - Before your change can be merged, it should be squashed into a single
commit for cleaner commit history.
+
+
+## Releases
+
+ - Apache Hudi community plans to do minor version releases every 6 weeks or
so.
+ - If your contribution merged onto `master` branch after the last release, it
will become part of next release.
+ - Website changes are regenerated once a week (until automation in place to
reflect immediately)
+
+
+## Accounts and Permissions
+
+ - [Hudi issue tracker
(JIRA)](https://issues.apache.org/jira/projects/HUDI/issues):
+ Anyone can access it and browse issues. Anyone can register an account and
login
+ to create issues or add comments. Only contributors can be assigned issues.
If
+ you want to be assigned issues, a PMC member can add you to the project
contributor
+ group. Email the dev mailing list to ask to be added as a contributor, and
include your ASF Jira username.
+
+ - [Hudi Wiki Space](https://cwiki.apache.org/confluence/display/HUDI):
+ Anyone has read access. If you wish to contribute changes, please create an
account and
+ request edit access on the dev@ mailing list (include your Wiki account
user ID).
+
+ - Pull requests can only be merged by a HUDI committer, listed
[here](https://incubator.apache.org/projects/hudi.html)
+
+ - [Voting on a release](https://www.apache.org/foundation/voting.html):
Everyone can vote.
+ Only Hudi PMC members should mark their votes as binding.
+
+## Communication
+
+All communication is expected to align with the [Code of
Conduct](https://www.apache.org/foundation/policies/conduct).
+Discussion about contributing code to Hudi happens on the [dev@ mailing
list](community.html). Introduce yourself!
+
+
+## Code & Project Structure
+
+ * `docker` : Docker containers used by demo and integration tests. Brings up
a mini data ecosystem locally
+ * `hoodie-cli` : CLI to inspect, manage and administer datasets
+ * `hoodie-client` : Spark client library to take a bunch of inserts +
updates and apply them to a Hoodie table
+ * `hoodie-common` : Common classes used across modules
+ * `hoodie-hadoop-mr` : InputFormat implementations for ReadOptimized,
Incremental, Realtime views
+ * `hoodie-hive` : Manage hive tables off Hudi datasets and houses the
HiveSyncTool
+ * `hoodie-integ-test` : Longer running integration test processes
+ * `hoodie-spark` : Spark datasource for writing and reading Hudi datasets.
Streaming sink.
+ * `hoodie-utilities` : Houses tools like DeltaStreamer, SnapshotCopier
+ * `packaging` : Poms for building out bundles for easier drop in to Spark,
Hive, Presto
+ * `style` : Code formatting, checkstyle files
+
+
+## Website
+
+[Apache Hudi site](https://hudi.apache.org) is hosted on a special `asf-site`
branch. Please follow the `README` file under `docs` on that branch for
+instructions on making changes to the website.
diff --git a/docs/dev_setup.md b/docs/dev_setup.md
deleted file mode 100644
index 1bdeec7..0000000
--- a/docs/dev_setup.md
+++ /dev/null
@@ -1,13 +0,0 @@
----
-title: Developer Setup
-keywords: developer setup
-sidebar: mydoc_sidebar
-permalink: dev_setup.html
----
-
-### Code Style
-
- We have embraced the code style largely based on [google
format](https://google.github.io/styleguide/javaguide.html).
- Please setup your IDE with style files from [here](../style/)
- We also recommend setting up the [Save Action
Plugin](https://plugins.jetbrains.com/plugin/7642-save-actions) to auto format
& organize imports on save.
- The Maven Compilation life-cycle will fail if there are checkstyle violations.
diff --git a/docs/images/hoodie_cow.png b/docs/images/hoodie_cow.png
deleted file mode 100644
index bad15a8..0000000
Binary files a/docs/images/hoodie_cow.png and /dev/null differ
diff --git a/docs/images/hoodie_mor.png b/docs/images/hoodie_mor.png
deleted file mode 100644
index 8d7d902..0000000
Binary files a/docs/images/hoodie_mor.png and /dev/null differ
diff --git a/docs/images/hudi_cow.png b/docs/images/hudi_cow.png
new file mode 100644
index 0000000..40aca71
Binary files /dev/null and b/docs/images/hudi_cow.png differ
diff --git a/docs/images/hudi_mor.png b/docs/images/hudi_mor.png
new file mode 100644
index 0000000..100b8f0
Binary files /dev/null and b/docs/images/hudi_mor.png differ
diff --git a/docs/images/hoodie_timeline.png b/docs/images/hudi_timeline.png
similarity index 100%
rename from docs/images/hoodie_timeline.png
rename to docs/images/hudi_timeline.png
diff --git a/docs/implementation.md b/docs/implementation.md
index 6215155..e87a541 100644
--- a/docs/implementation.md
+++ b/docs/implementation.md
@@ -23,7 +23,7 @@ Hudi upsert/insert is merely a Spark DAG, that can be broken
into two big pieces
Hudi currently provides two choices for indexes : `BloomIndex` and
`HBaseIndex` to map a record key into the file id to which it belongs to. This
enables
us to speed up upserts significantly, without scanning over every record in
the dataset. Hudi Indices can be classified based on
-their ability to lookup records across partition. A `global` index does not
need partition information for finding the file-id for a record key
+their ability to lookup records across partition. A `global` index does not
need partition information for finding the file-id for a record key
but a `non-global` does.
#### HBase Index (global)
@@ -63,8 +63,8 @@ records such that
In this storage, index updation is a no-op, since the bloom filters are
already written as a part of committing data.
-In the case of Copy-On-Write, a single parquet file constitutes one `file
slice` which contains one complete version of
-the file
+In the case of Copy-On-Write, a single parquet file constitutes one `file
slice` which contains one complete version of
+the file
{% include image.html file="hoodie_log_format_v2.png"
alt="hoodie_log_format_v2.png" max-width="1000" %}
@@ -73,27 +73,27 @@ the file
In the Merge-On-Read storage model, there are 2 logical components - one for
ingesting data (both inserts/updates) into the dataset
and another for creating compacted views. The former is hereby referred to as
`Writer` while the later
is referred as `Compactor`.
-
+
##### Merge On Read Writer
-
+
At a high level, Merge-On-Read Writer goes through same stages as
Copy-On-Write writer in ingesting data.
- The key difference here is that updates are appended to latest log (delta)
file belonging to the latest file slice
+ The key difference here is that updates are appended to latest log (delta)
file belonging to the latest file slice
without merging. For inserts, Hudi supports 2 modes:
1. Inserts to Log Files - This is done for datasets that have an indexable
log files (for eg global index)
2. Inserts to parquet files - This is done for datasets that do not have
indexable log files, for eg bloom index
embedded in parquer files. Hudi treats writing new records in the same
way as inserting to Copy-On-Write files.
-As in the case of Copy-On-Write, the input tagged records are partitioned such
that all upserts destined to
+As in the case of Copy-On-Write, the input tagged records are partitioned such
that all upserts destined to
a `file id` are grouped together. This upsert-batch is written as one or more
log-blocks written to log-files.
Hudi allows clients to control log file sizes (See [Storage
Configs](../configurations))
The WriteClient API is same for both Copy-On-Write and Merge-On-Read writers.
-
+
With Merge-On-Read, several rounds of data-writes would have resulted in
accumulation of one or more log-files.
All these log-files along with base-parquet (if exists) constitute a `file
slice` which represents one complete version
-of the file.
-
+of the file.
+
#### Compactor
Realtime Readers will perform in-situ merge of these delta log-files to
provide the most recent (committed) view of
@@ -106,48 +106,52 @@ Asynchronous Compaction involves 2 steps:
to be compacted atomically in a single compaction commit. Hudi allows
pluggable strategies for choosing
file slices for each compaction runs. This step is typically done inline
by Writer process as Hudi expects
only one schedule is being generated at a time which allows Hudi to
enforce the constraint that pending compaction
- plans do not step on each other file-slices. This constraint allows for
multiple concurrent `Compactors` to run at
+ plans do not step on each other file-slices. This constraint allows for
multiple concurrent `Compactors` to run at
the same time. Some of the common strategies used for choosing `file
slice` for compaction are:
- * BoundedIO - Limit the number of file slices chosen for a compaction plan
by expected total IO (read + write)
- needed to complete compaction run
+ * BoundedIO - Limit the number of file slices chosen for a compaction plan
by expected total IO (read + write)
+ needed to complete compaction run
* Log File Size - Prefer file-slices with larger amounts of delta log data
to be merged
* Day Based - Prefer file slice belonging to latest day partitions
- ```
- API for scheduling compaction
- /**
- * Schedules a new compaction instant
- * @param extraMetadata
- * @return Compaction Instant timestamp if a new compaction plan is
scheduled
- */
- Optional<String> scheduleCompaction(Optional<Map<String, String>>
extraMetadata) throws IOException;
- ```
+
* `Compactor` : Hudi provides a separate API in Write Client to execute a
compaction plan. The compaction
plan (just like a commit) is identified by a timestamp. Most of the design
and implementation complexities for Async
Compaction is for guaranteeing snapshot isolation to readers and writer
when
multiple concurrent compactors are running. Typical compactor deployment
involves launching a separate
spark application which executes pending compactions when they become
available. The core logic of compacting
file slices in the Compactor is very similar to that of merging updates in
a Copy-On-Write table. The only
- difference being in the case of compaction, there is an additional step of
merging the records in delta log-files.
-
- Here are the main API to lookup and execute a compaction plan.
- ```
- Main API in HoodieWriteClient for running Compaction:
- /**
- * Performs Compaction corresponding to instant-time
- * @param compactionInstantTime Compaction Instant Time
- * @return
- * @throws IOException
- */
- public JavaRDD<WriteStatus> compact(String compactionInstantTime)
throws IOException;
-
- To lookup all pending compactions, use the API defined in
HoodieReadClient
-
- /**
- * Return all pending compactions with instant time for clients to
decide what to compact next.
- * @return
- */
- public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
- ```
+ difference being in the case of compaction, there is an additional step of
merging the records in delta log-files.
+
+Here are the main API to lookup and execute a compaction plan.
+
+```
+ Main API in HoodieWriteClient for running Compaction:
+ /**
+ * Performs Compaction corresponding to instant-time
+ * @param compactionInstantTime Compaction Instant Time
+ * @return
+ * @throws IOException
+ */
+ public JavaRDD<WriteStatus> compact(String compactionInstantTime) throws
IOException;
+
+ To lookup all pending compactions, use the API defined in HoodieReadClient
+
+ /**
+ * Return all pending compactions with instant time for clients to decide
what to compact next.
+ * @return
+ */
+ public List<Pair<String, HoodieCompactionPlan>> getPendingCompactions();
+```
+API for scheduling compaction
+
+```
+
+ /**
+ * Schedules a new compaction instant
+ * @param extraMetadata
+ * @return Compaction Instant timestamp if a new compaction plan is
scheduled
+ */
+ Optional<String> scheduleCompaction(Optional<Map<String, String>>
extraMetadata) throws IOException;
+```
Refer to __hoodie-client/src/test/java/HoodieClientExample.java__ class for
an example of how compaction
is scheduled and executed.
@@ -172,65 +176,65 @@ plan to be run to figure out the number of file slices
being compacted and choos
## Async Compaction Design Deep-Dive (Optional)
-For the purpose of this section, it is important to distinguish between 2
types of commits as pertaining to the file-group:
+For the purpose of this section, it is important to distinguish between 2
types of commits as pertaining to the file-group:
A commit which generates a merged and read-optimized file-slice is called
`snapshot commit` (SC) with respect to that file-group.
-A commit which merely appended the new/updated records assigned to the
file-group into a new log block is called `delta commit` (DC)
+A commit which merely appended the new/updated records assigned to the
file-group into a new log block is called `delta commit` (DC)
with respect to that file-group.
### Algorithm
The algorithm is described with an illustration. Let us assume a scenario
where there are commits SC1, DC2, DC3 that have
-already completed on a data-set. Commit DC4 is currently ongoing with the
writer (ingestion) process using it to upsert data.
-Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set
whose latest version (`File-Slice`)
-contains the base file created by commit SC1 (snapshot-commit in columnar
format) and a log file containing row-based
-log blocks of 2 delta-commits (DC2 and DC3).
+already completed on a data-set. Commit DC4 is currently ongoing with the
writer (ingestion) process using it to upsert data.
+Let us also imagine there are a set of file-groups (FG1 … FGn) in the data-set
whose latest version (`File-Slice`)
+contains the base file created by commit SC1 (snapshot-commit in columnar
format) and a log file containing row-based
+log blocks of 2 delta-commits (DC2 and DC3).
{% include image.html file="async_compac_1.png" alt="async_compac_1.png"
max-width="1000" %}
- * Writer (Ingestion) that is going to commit "DC4" starts. The record updates
in this batch are grouped by file-groups
- and appended in row formats to the corresponding log file as delta commit.
Let us imagine a subset of file-groups has
+ * Writer (Ingestion) that is going to commit "DC4" starts. The record updates
in this batch are grouped by file-groups
+ and appended in row formats to the corresponding log file as delta commit.
Let us imagine a subset of file-groups has
this new log block (delta commit) DC4 added.
- * Before the writer job completes, it runs the compaction strategy to decide
which file-group to compact by compactor
- and creates a new compaction-request commit SC5. This commit file is marked
as “requested” with metadata denoting
- which fileIds to compact (based on selection policy). Writer completes
without running compaction (will be run async).
-
+ * Before the writer job completes, it runs the compaction strategy to decide
which file-group to compact by compactor
+ and creates a new compaction-request commit SC5. This commit file is marked
as “requested” with metadata denoting
+ which fileIds to compact (based on selection policy). Writer completes
without running compaction (will be run async).
+
{% include image.html file="async_compac_2.png" alt="async_compac_2.png"
max-width="1000" %}
-
- * Writer job runs again ingesting next batch. It starts with commit DC6. It
reads the earliest inflight compaction
- request marker commit in timeline order and collects the (fileId,
Compaction Commit Id “CcId” ) pairs from meta-data.
- Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets
allocated for the file-group.
- The Writer will simply append records in row-format to the first log-file
(as delta-commit) assuming the
+
+ * Writer job runs again ingesting next batch. It starts with commit DC6. It
reads the earliest inflight compaction
+ request marker commit in timeline order and collects the (fileId,
Compaction Commit Id “CcId” ) pairs from meta-data.
+ Ingestion DC6 ensures a new file-slice with base-commit “CcId” gets
allocated for the file-group.
+ The Writer will simply append records in row-format to the first log-file
(as delta-commit) assuming the
base-file (“Phantom-Base-File”) will be created eventually by the compactor.
-
+
{% include image.html file="async_compac_3.png" alt="async_compac_3.png"
max-width="1000" %}
-
- * Compactor runs at some time and commits at “Tc” (concurrently or
before/after Ingestion DC6). It reads the commit-timeline
- and finds the first unprocessed compaction request marker commit. Compactor
reads the commit’s metadata finding the
- file-slices to be compacted. It compacts the file-slice and creates the
missing base-file (“Phantom-Base-File”)
- with “CCId” as the commit-timestamp. Compactor then marks the compaction
commit timestamp as completed.
- It is important to realize that at data-set level, there could be different
file-groups requesting compaction at
+
+ * Compactor runs at some time and commits at “Tc” (concurrently or
before/after Ingestion DC6). It reads the commit-timeline
+ and finds the first unprocessed compaction request marker commit. Compactor
reads the commit’s metadata finding the
+ file-slices to be compacted. It compacts the file-slice and creates the
missing base-file (“Phantom-Base-File”)
+ with “CCId” as the commit-timestamp. Compactor then marks the compaction
commit timestamp as completed.
+ It is important to realize that at data-set level, there could be different
file-groups requesting compaction at
different commit timestamps.
-
+
{% include image.html file="async_compac_4.png" alt="async_compac_4.png"
max-width="1000" %}
- * Near Real-time reader interested in getting the latest snapshot will have 2
cases. Let us assume that the
+ * Near Real-time reader interested in getting the latest snapshot will have 2
cases. Let us assume that the
incremental ingestion (writer at DC6) happened before the compaction (some
time “Tc”’).
- The below description is with regards to compaction from file-group
perspective.
- * `Reader querying at time between ingestion completion time for DC6 and
compaction finish “Tc”`:
- Hoodie’s implementation will be changed to become aware of file-groups
currently waiting for compaction and
- merge log-files corresponding to DC2-DC6 with the base-file corresponding
to SC1. In essence, Hudi will create
- a pseudo file-slice by combining the 2 file-slices starting at
base-commits SC1 and SC5 to one.
- For file-groups not waiting for compaction, the reader behavior is
essentially the same - read latest file-slice
+ The below description is with regards to compaction from file-group
perspective.
+ * `Reader querying at time between ingestion completion time for DC6 and
compaction finish “Tc”`:
+ Hoodie’s implementation will be changed to become aware of file-groups
currently waiting for compaction and
+ merge log-files corresponding to DC2-DC6 with the base-file corresponding
to SC1. In essence, Hudi will create
+ a pseudo file-slice by combining the 2 file-slices starting at
base-commits SC1 and SC5 to one.
+ For file-groups not waiting for compaction, the reader behavior is
essentially the same - read latest file-slice
and merge on the fly.
- * `Reader querying at time after compaction finished (> “Tc”)` : In this
case, reader will not find any pending
- compactions in the timeline and will simply have the current behavior of
reading the latest file-slice and
+ * `Reader querying at time after compaction finished (> “Tc”)` : In this
case, reader will not find any pending
+ compactions in the timeline and will simply have the current behavior of
reading the latest file-slice and
merging on-the-fly.
-
- * Read-Optimized View readers will query against the latest columnar
base-file for each file-groups.
+
+ * Read-Optimized View readers will query against the latest columnar
base-file for each file-groups.
The above algorithm explains Async compaction w.r.t a single compaction run on
a single file-group. It is important
-to note that multiple compaction plans can be run concurrently as they are
essentially operating on different
+to note that multiple compaction plans can be run concurrently as they are
essentially operating on different
file-groups.
## Performance
@@ -272,4 +276,3 @@ with no impact on queries. Following charts compare the
Hudi vs non-Hudi dataset
**Presto**
{% include image.html file="hoodie_query_perf_presto.png"
alt="hoodie_query_perf_presto.png" max-width="1000" %}
-
diff --git a/docs/index.md b/docs/index.md
index b5b9da7..ad87933 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,12 +4,9 @@ keywords: homepage
tags: [getting_started]
sidebar: mydoc_sidebar
permalink: index.html
-summary: "Hudi lowers data latency across the board, while simultaneously
achieving orders of magnitude of efficiency over traditional batch processing."
+summary: "Hudi brings stream processing to big data, providing fresh data
while being an order of magnitude efficient over traditional batch processing."
---
-
-
-
Hudi (pronounced “Hoodie”) ingests & manages storage of large analytical
datasets on
[HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
or cloud stores and provides three logical views for query access.
* **Read Optimized View** - Provides excellent query performance on pure
columnar storage, much like plain [Parquet](https://parquet.apache.org/) tables.
@@ -21,4 +18,4 @@ Hudi (pronounced “Hoodie”) ingests & manages storage of large
analytical dat
By carefully managing how data is laid out in storage & how it’s exposed to
queries, Hudi is able to power a rich data ecosystem where external sources can
be ingested in near real-time and made available for interactive SQL Engines
like [Presto](https://prestodb.io) & [Spark](https://spark.apache.org/sql/),
while at the same time capable of being consumed incrementally from
processing/ETL frameworks like [Hive](https://hive.apache.org/) &
[Spark](https://spark.apache.org/docs/latest/) t [...]
-Hudi broadly consists of a self contained Spark library to build datasets and
integrations with existing query engines for data access.
+Hudi broadly consists of a self contained Spark library to build datasets and
integrations with existing query engines for data access. See
[quickstart](quickstart.html) for a demo.
diff --git a/docs/migration_guide.md b/docs/migration_guide.md
index a5d5506..13c27ac 100644
--- a/docs/migration_guide.md
+++ b/docs/migration_guide.md
@@ -4,9 +4,8 @@ keywords: migration guide
sidebar: mydoc_sidebar
permalink: migration_guide.html
toc: false
-summary: In this page, we will discuss some available tools for migrating your
existing dataset into a Hudi managed
-dataset
-
+summary: In this page, we will discuss some available tools for migrating your
existing dataset into a Hudi dataset
+---
Hudi maintains metadata such as commit timeline and indexes to manage a
dataset. The commit timelines helps to understand the actions happening on a
dataset as well as the current state of a dataset. Indexes are used by Hudi to
maintain a record key to file id mapping to efficiently locate a record. At the
moment, Hudi supports writing only parquet columnar formats.
To be able to start using Hudi for your existing dataset, you will need to
migrate your existing dataset into a Hudi managed dataset. There are a couple
of ways to achieve this.
@@ -15,57 +14,60 @@ To be able to start using Hudi for your existing dataset,
you will need to migra
## Approaches
-### Approach 1
+#### Use Hudi for new partitions alone
-Hudi can be used to manage an existing dataset without affecting/altering the
historical data already present in the
-dataset. Hudi has been implemented to be compatible with such a mixed dataset
with a caveat that either the complete
-Hive partition is Hudi managed or not. Thus the lowest granularity at which
Hudi manages a dataset is a Hive
-partition. Start using the datasource API or the WriteClient to write to the
dataset and make sure you start writing
+Hudi can be used to manage an existing dataset without affecting/altering the
historical data already present in the
+dataset. Hudi has been implemented to be compatible with such a mixed dataset
with a caveat that either the complete
+Hive partition is Hudi managed or not. Thus the lowest granularity at which
Hudi manages a dataset is a Hive
+partition. Start using the datasource API or the WriteClient to write to the
dataset and make sure you start writing
to a new partition or convert your last N partitions into Hudi instead of the
entire table. Note, since the historical
- partitions are not managed by HUDI, none of the primitives provided by HUDI
work on the data in those partitions. More concretely, one cannot perform
upserts or incremental pull on such older partitions not managed by the HUDI
dataset.
+ partitions are not managed by HUDI, none of the primitives provided by HUDI
work on the data in those partitions. More concretely, one cannot perform
upserts or incremental pull on such older partitions not managed by the HUDI
dataset.
Take this approach if your dataset is an append only type of dataset and you
do not expect to perform any updates to existing (or non Hudi managed)
partitions.
-### Approach 2
+#### Convert existing dataset to Hudi
Import your existing dataset into a Hudi managed dataset. Since all the data
is Hudi managed, none of the limitations
- of Approach 1 apply here. Updates spanning any partitions can be applied to
this dataset and Hudi will efficiently
- make the update available to queries. Note that not only do you get to use
all Hoodie primitives on this dataset,
+ of Approach 1 apply here. Updates spanning any partitions can be applied to
this dataset and Hudi will efficiently
+ make the update available to queries. Note that not only do you get to use
all Hoodie primitives on this dataset,
there are other additional advantages of doing this. Hudi automatically
manages file sizes of a Hudi managed dataset
- . You can define the desired file size when converting this dataset and Hudi
will ensure it writes out files
- adhering to the config. It will also ensure that smaller files later get
corrected by routing some new inserts into
+ . You can define the desired file size when converting this dataset and Hudi
will ensure it writes out files
+ adhering to the config. It will also ensure that smaller files later get
corrected by routing some new inserts into
small files rather than writing new small ones thus maintaining the health of
your cluster.
There are a few options when choosing this approach.
+
#### Option 1
-Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing dataset is in
-parquet file
-format. This tool essentially starts a Spark Job to read the existing parquet
dataset and converts it into a HUDI managed dataset by re-writing all the data.
-#### Option 2
+Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing dataset is in
+parquet file
+format. This tool essentially starts a Spark Job to read the existing parquet
dataset and converts it into a HUDI managed dataset by re-writing all the data.
+
+#### Option 2
For huge datasets, this could be as simple as : for partition in [list of
partitions in source dataset] {
val inputDF =
spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("com.uber.hoodie").option()....save("basePath")
}
+
#### Option 3
Write your own custom logic of how to load an existing dataset into a Hudi
managed one. Please read about the RDD API
- [here](quickstart.md).
+ [here](quickstart.html).
```
-Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean
install -DskipTests`, the shell can be
+Using the HDFSParquetImporter Tool. Once hoodie has been built via `mvn clean
install -DskipTests`, the shell can be
fired by via `cd hoodie-cli && ./hoodie-cli.sh`.
-hoodie->hdfsparquetimport
- --upsert false
- --srcPath /user/parquet/dataset/basepath
- --targetPath
- /user/hoodie/dataset/basepath
- --tableName hoodie_table
- --tableType COPY_ON_WRITE
- --rowKeyField _row_key
- --partitionPathField partitionStr
- --parallelism 1500
- --schemaFilePath /user/table/schema
- --format parquet
- --sparkMemory 6g
+hoodie->hdfsparquetimport
+ --upsert false
+ --srcPath /user/parquet/dataset/basepath
+ --targetPath
+ /user/hoodie/dataset/basepath
+ --tableName hoodie_table
+ --tableType COPY_ON_WRITE
+ --rowKeyField _row_key
+ --partitionPathField partitionStr
+ --parallelism 1500
+ --schemaFilePath /user/table/schema
+ --format parquet
+ --sparkMemory 6g
--retry 2
-```
\ No newline at end of file
+```
diff --git a/docs/quickstart.md b/docs/quickstart.md
index f1516ae..1e6fa49 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -13,13 +13,14 @@ permalink: quickstart.html
Check out code and pull it into Intellij as a normal maven project.
Normally build the maven project, from command line
+
```
$ mvn clean install -DskipTests -DskipITs
+```
To work with older version of Hive (pre Hive-1.2.1), use
-
+```
$ mvn clean install -DskipTests -DskipITs -Dhive11
-
```
{% include callout.html content="You might want to add your spark jars folder
to project dependencies under 'Module Setttings', to be able to run Spark from
IDE" type="info" %}
@@ -31,13 +32,13 @@ $ mvn clean install -DskipTests -DskipITs -Dhive11
Hudi requires Java 8 to be installed. Hudi works with Spark-2.x versions. We
have verified that Hudi works with the following combination of
Hadoop/Hive/Spark.
-| Hadoop | Hive | Spark | Instructions to Build Hudi |
+| Hadoop | Hive | Spark | Instructions to Build Hudi |
| ---- | ----- | ---- | ---- |
| 2.6.0-cdh5.7.2 | 1.1.0-cdh5.7.2 | spark-2.[1-3].x | Use “mvn clean install
-DskipTests -Dhadoop.version=2.6.0-cdh5.7.2 -Dhive.version=1.1.0-cdh5.7.2” |
| Apache hadoop-2.8.4 | Apache hive-2.3.3 | spark-2.[1-3].x | Use "mvn clean
install -DskipTests" |
| Apache hadoop-2.7.3 | Apache hive-1.2.1 | spark-2.[1-3].x | Use "mvn clean
install -DskipTests" |
-If your environment has other versions of hadoop/hive/spark, please try out
Hudi and let us know if there are any issues. We are limited by our bandwidth
to certify other combinations.
+If your environment has other versions of hadoop/hive/spark, please try out
Hudi and let us know if there are any issues. We are limited by our bandwidth
to certify other combinations.
It would be of great help if you can reach out to us with your setup and
experience with hoodie.
## Generate a Hudi Dataset
@@ -60,7 +61,7 @@ export
PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$P
### Supported API's
-Use the DataSource API to quickly start reading or writing Hudi datasets in
few lines of code. Ideal for most
+Use the DataSource API to quickly start reading or writing Hudi datasets in
few lines of code. Ideal for most
ingestion use-cases.
Use the RDD API to perform more involved actions on a Hudi dataset
@@ -132,11 +133,11 @@ This can be run as frequently as the ingestion pipeline
to make sure new partiti
cd hoodie-hive
./run_sync_tool.sh
--user hive
- --pass hive
- --database default
- --jdbc-url "jdbc:hive2://localhost:10010/"
- --base-path tmp/hoodie/sample-table/
- --table hoodie_test
+ --pass hive
+ --database default
+ --jdbc-url "jdbc:hive2://localhost:10010/"
+ --base-path tmp/hoodie/sample-table/
+ --table hoodie_test
--partitioned-by field1,field2
```
@@ -304,7 +305,7 @@ hive>
## A Demo using docker containers
Lets use a real world example to see how hudi works end to end. For this
purpose, a self contained
-data infrastructure is brought up in a local docker cluster within your
computer.
+data infrastructure is brought up in a local docker cluster within your
computer.
The steps assume you are using Mac laptop
@@ -313,7 +314,7 @@ The steps assume you are using Mac laptop
* Docker Setup : For Mac, Please follow the steps as defined in
[https://docs.docker.com/v17.12/docker-for-mac/install/]. For running Spark-SQL
queries, please ensure atleast 6 GB and 4 CPUs are allocated to Docker (See
Docker -> Preferences -> Advanced). Otherwise, spark-SQL queries could be
killed because of memory issues.
* kafkacat : A command-line utility to publish/consume from kafka topics.
Use `brew install kafkacat` to install kafkacat
* /etc/hosts : The demo references many services running in container by the
hostname. Add the following settings to /etc/hosts
-
+
```
127.0.0.1 adhoc-1
127.0.0.1 adhoc-2
@@ -378,15 +379,15 @@ At this point, the docker cluster will be up and running.
The demo cluster bring
* HDFS Services (NameNode, DataNode)
* Spark Master and Worker
* Hive Services (Metastore, HiveServer2 along with PostgresDB)
- * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source
for the demo)
+ * Kafka Broker and a Zookeeper Node (Kakfa will be used as upstream source
for the demo)
* Adhoc containers to run Hudi/Hive CLI commands
### Demo
-Stock Tracker data will be used to showcase both different Hudi Views and the
effects of Compaction.
+Stock Tracker data will be used to showcase both different Hudi Views and the
effects of Compaction.
-Take a look at the directory `docker/demo/data`. There are 2 batches of stock
data - each at 1 minute granularity.
-The first batch contains stocker tracker data for some stock symbols during
the first hour of trading window
+Take a look at the directory `docker/demo/data`. There are 2 batches of stock
data - each at 1 minute granularity.
+The first batch contains stocker tracker data for some stock symbols during
the first hour of trading window
(9:30 a.m to 10:30 a.m). The second batch contains tracker data for next 30
mins (10:30 - 11 a.m). Hudi will
be used to ingest these batches to a dataset which will contain the latest
stock tracker data at hour level granularity.
The batches are windowed intentionally so that the second batch contains
updates to some of the rows in the first batch.
@@ -396,7 +397,7 @@ The batches are windowed intentionally so that the second
batch contains updates
Upload the first batch to Kafka topic 'stock ticks'
```
-cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
+cat docker/demo/data/batch_1.json | kafkacat -b kafkabroker -t stock_ticks -P
To check if the new topic shows up, use
kafkacat -b kafkabroker -L -J | jq .
@@ -443,7 +444,7 @@ kafkacat -b kafkabroker -L -J | jq .
Hudi comes with a tool named DeltaStreamer. This tool can connect to variety
of data sources (including Kafka) to
pull changes and apply to Hudi dataset using upsert/insert primitives. Here,
we will use the tool to download
-json data from kafka topic and ingest to both COW and MOR tables we
initialized in the previous step. This tool
+json data from kafka topic and ingest to both COW and MOR tables we
initialized in the previous step. This tool
automatically initializes the datasets in the file-system if they do not exist
yet.
```
@@ -468,8 +469,8 @@ spark-submit --class
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
exit
```
-You can use HDFS web-browser to look at the datasets
-`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
+You can use HDFS web-browser to look at the datasets
+`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
You can explore the new partition folder created in the dataset along with a
"deltacommit"
file under .hoodie which signals a successful commit.
@@ -501,7 +502,7 @@ docker exec -it adhoc-2 /bin/bash
....
exit
```
-After executing the above command, you will notice
+After executing the above command, you will notice
1. A hive table named `stock_ticks_cow` created which provides Read-Optimized
view for the Copy On Write dataset.
2. Two new tables `stock_ticks_mor` and `stock_ticks_mor_rt` created for the
Merge On Read dataset. The former
@@ -511,7 +512,7 @@ provides the ReadOptimized view for the Hudi dataset and
the later provides the
#### Step 4 (a): Run Hive Queries
Run a hive query to find the latest timestamp ingested for stock symbol
'GOOG'. You will notice that both read-optimized
-(for both COW and MOR dataset)and realtime views (for MOR dataset)give the
same value "10:29 a.m" as Hudi create a
+(for both COW and MOR dataset)and realtime views (for MOR dataset)give the
same value "10:29 a.m" as Hudi create a
parquet file for the first batch of data.
```
@@ -565,7 +566,7 @@ Now, run a projection query:
# Merge-On-Read Queries:
==========================
-Lets run similar queries against M-O-R dataset. Lets look at both
+Lets run similar queries against M-O-R dataset. Lets look at both
ReadOptimized and Realtime views supported by M-O-R dataset
# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -670,7 +671,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
# Merge-On-Read Queries:
==========================
-Lets run similar queries against M-O-R dataset. Lets look at both
+Lets run similar queries against M-O-R dataset. Lets look at both
ReadOptimized and Realtime views supported by M-O-R dataset
# Run against ReadOptimized View. Notice that the latest timestamp is 10:29
@@ -718,7 +719,7 @@ Upload the second batch of data and ingest this batch using
delta-streamer. As t
partitions, there is no need to run hive-sync
```
-cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
+cat docker/demo/data/batch_2.json | kafkacat -b kafkabroker -t stock_ticks -P
# Within Docker container, run the ingestion command
docker exec -it adhoc-2 /bin/bash
@@ -734,15 +735,15 @@ exit
With Copy-On-Write table, the second ingestion by DeltaStreamer resulted in a
new version of Parquet file getting created.
See
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow/2018/08/31`
-With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
+With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
Take a look at the HDFS filesystem to get an idea:
`http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_mor/2018/08/31`
#### Step 6(a): Run Hive Queries
-With Copy-On-Write table, the read-optimized view immediately sees the changes
as part of second batch once the batch
-got committed as each ingestion creates newer versions of parquet files.
+With Copy-On-Write table, the read-optimized view immediately sees the changes
as part of second batch once the batch
+got committed as each ingestion creates newer versions of parquet files.
-With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
+With Merge-On-Read table, the second ingestion merely appended the batch to an
unmerged delta (log) file.
This is the time, when ReadOptimized and Realtime views will provide different
results. ReadOptimized view will still
return "10:29 am" as it will only read from the Parquet file. Realtime View
will do on-the-fly merge and return
latest committed data which is "10:59 a.m".
@@ -773,7 +774,7 @@ WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
available in the futu
As you can notice, the above queries now reflect the changes that came as part
of ingesting second batch.
-# Merge On Read Table:
+# Merge On Read Table:
# Read Optimized View
0: jdbc:hive2://hiveserver:10000> select symbol, max(ts) from stock_ticks_mor
group by symbol HAVING symbol = 'GOOG';
@@ -843,7 +844,7 @@ scala> spark.sql("select `_hoodie_commit_time`, symbol, ts,
volume, open, close
As you can notice, the above queries now reflect the changes that came as part
of ingesting second batch.
-# Merge On Read Table:
+# Merge On Read Table:
# Read Optimized View
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor group by symbol
HAVING symbol = 'GOOG'").show(100, false)
@@ -909,8 +910,8 @@ To show the effects of incremental-query, let us assume
that a reader has alread
ingesting first batch. Now, for the reader to see effect of the second batch,
he/she has to keep the start timestamp to
the commit time of the first batch (20180924064621) and run incremental query
-`Hudi incremental mode` provides efficient scanning for incremental queries by
filtering out files that do not have any
-candidate rows using hudi-managed metadata.
+`Hudi incremental mode` provides efficient scanning for incremental queries by
filtering out files that do not have any
+candidate rows using hudi-managed metadata.
```
docker exec -it adhoc-2 /bin/bash
@@ -1008,7 +1009,7 @@ hoodie:stock_ticks_mor->compactions show all
___________________________________________________________________
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
-
+
# Schedule a compaction. This will use Spark Launcher to schedule compaction
hoodie:stock_ticks_mor->compaction schedule
....
@@ -1028,7 +1029,7 @@ hoodie:stock_ticks_mor->compactions show all
___________________________________________________________________
| Compaction Instant Time| State | Total FileIds to be Compacted|
|==================================================================|
- | 20180924070031 | REQUESTED| 1 |
+ | 20180924070031 | REQUESTED| 1 |
# Execute the compaction. The compaction instant value passed below must be
the one displayed in the above "compactions show all" query
hoodie:stock_ticks_mor->compaction run --compactionInstant 20180924070031
--parallelism 2 --sparkMemory 1G --schemaFilePath /var/demo/config/schema.avsc
--retry 1
@@ -1052,7 +1053,7 @@ hoodie:stock_ticks->compactions show all
|==================================================================|
| 20180924070031 | COMPLETED| 1 |
-```
+```
#### Step 9: Run Hive Queries including incremental queries
@@ -1169,9 +1170,9 @@ You can bring up a hadoop docker environment containing
Hadoop, Hive and Spark s
```
$ mvn pre-integration-test -DskipTests
```
-The above command builds docker images for all the services with
-current Hudi source installed at /var/hoodie/ws and also brings up the
services using a compose file. We
-currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker
images.
+The above command builds docker images for all the services with
+current Hudi source installed at /var/hoodie/ws and also brings up the
services using a compose file. We
+currently use Hadoop (v2.8.4), Hive (v2.3.3) and Spark (v2.3.1) in docker
images.
To bring down the containers
```
@@ -1185,9 +1186,9 @@ $ cd hoodie-integ-test
$ mvn docker-compose:up -DdetachedMode=true
```
-Hudi is a library that is operated in a broader data analytics/ingestion
environment
+Hudi is a library that is operated in a broader data analytics/ingestion
environment
involving Hadoop, Hive and Spark. Interoperability with all these systems is a
key objective for us. We are
-actively adding integration-tests under __hoodie-integ-test/src/test/java__
that makes use of this
+actively adding integration-tests under __hoodie-integ-test/src/test/java__
that makes use of this
docker environment (See
__hoodie-integ-test/src/test/java/com/uber/hoodie/integ/ITTestHoodieSanity.java__
)
@@ -1202,10 +1203,10 @@ and compose scripts are carefully implemented so that
they serve dual-purpose
inbuilt jars by mounting local HUDI workspace over the docker location
This helps avoid maintaining separate docker images and avoids the costly step
of building HUDI docker images locally.
-But if users want to test hudi from locations with lower network bandwidth,
they can still build local images
-run the script
+But if users want to test hudi from locations with lower network bandwidth,
they can still build local images
+run the script
`docker/build_local_docker_images.sh` to build local docker images before
running `docker/setup_demo.sh`
-
+
Here are the commands:
```
diff --git a/docs/roadmap.md b/docs/roadmap.md
deleted file mode 100644
index c65c3a9..0000000
--- a/docs/roadmap.md
+++ /dev/null
@@ -1,14 +0,0 @@
----
-title: Roadmap
-keywords: usecases
-sidebar: mydoc_sidebar
-permalink: roadmap.html
----
-
-## Planned Features
-
-* Support for Self Joins - As of now, you cannot incrementally consume the
same table more than once, since the InputFormat does not understand the
QueryPlan.
-* Hudi Spark Datasource - Allows for reading and writing data back using
Apache Spark natively (without falling back to InputFormat), which can be more
performant
-* Hudi Presto Connector - Allows for querying data managed by Hudi using
Presto natively, which can again boost
[performance](https://prestodb.io/docs/current/release/release-0.138.html)
-
-
diff --git a/docs/sql_queries.md b/docs/sql_queries.md
index 955e794..44848eb 100644
--- a/docs/sql_queries.md
+++ b/docs/sql_queries.md
@@ -62,7 +62,4 @@
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.clas
## Presto
-Presto requires a [patch](https://github.com/prestodb/presto/pull/7002) (until
the PR is merged) and the hoodie-hadoop-mr-bundle jar to be placed
-into `<presto_install>/plugin/hive-hadoop2/`.
-
-{% include callout.html content="Get involved to improve this integration
[here](https://github.com/uber/hoodie/issues/81)" type="info" %}
+Presto requires the `hoodie-presto-bundle` jar to be placed into
`<presto_install>/plugin/hive-hadoop2/`, across the installation.