[GitHub] [hudi] bhasudha commented on a diff in pull request #9231: [HUDI-6552][DOCS] Restructure and reorganize FAQ

via GitHub Wed, 19 Jul 2023 11:58:16 -0700


bhasudha commented on code in PR #9231:
URL: https://github.com/apache/hudi/pull/9231#discussion_r1268541093



##########
website/docs/faq.md:
##########
@@ -225,179 +225,276 @@ inputDF.write().format("org.apache.hudi")
   ...
 ```
 
- - When using `HoodieWriteClient` directly, you can simply construct 
HoodieWriteConfig object with the configs in the link you mentioned.
-
- - When using HoodieStreamer tool to ingest, you can set the configs in 
properties file and pass the file as the cmdline argument "*--props*"
+*   When using `HoodieWriteClient` directly, you can simply construct 
HoodieWriteConfig object with the configs in the link you mentioned.
+*   When using HoodieDeltaStreamer tool to ingest, you can set the configs in 
properties file and pass the file as the cmdline argument "_\--props_"
 
 ### How to create Hive style partition folder structure?
 
 By default Hudi creates the partition folders with just the partition values, 
but if would like to create partition folders similar to the way Hive will 
generate the structure, with paths that contain key value pairs, like 
country=us/… or datestr=2021-04-20. This is Hive style (or format) 
partitioning. The paths include both the names of the partition keys and the 
values that each path represents.
 
 To enable hive style partitioning, you need to add this hoodie config when you 
write your data:
-```java
+
+```plain
 hoodie.datasource.write.hive_style_partitioning: true
 ```
 
+### Can I register my Hudi table with Apache Hive metastore?
+
+Yes. This can be performed either via the standalone [Hive Sync 
tool](https://hudi.apache.org/docs/syncing_metastore#hive-sync-tool) or using 
options in [Hudi 
Streamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50)
 tool or 
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable).
+
+### What's Hudi's schema evolution story?
+
+Hudi uses Avro as the internal canonical representation for records, primarily 
due to its nice [schema compatibility & 
evolution](https://docs.confluent.io/platform/current/schema-registry/avro.html)
 properties. This is a key aspect of having reliability in your ingestion or 
ETL pipelines. As long as the schema passed to Hudi (either explicitly in Hudi 
Streamer schema provider configs or implicitly by Spark Datasource's Dataset 
schemas) is backwards compatible (e.g no field deletes, only appending new 
fields to schema), Hudi will seamlessly handle read/write of old and new data 
and also keep the Hive schema up-to date.
+
+Starting 0.11.0, Spark SQL DDL support (experimental) was added for Spark 
3.1.x and Spark 3.2.1 via ALTER TABLE syntax. Please refer to the [schema 
evolution guide](https://hudi.apache.org/docs/schema_evolution) for more 
details on  Schema-on-read for Spark..
+
+### What performance/ingest latency can I expect for Hudi writing?
+
+The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/write_operations) and some trade-offs 
you make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks, Hudi operations may have overhead from 
supporting database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged to have this perspective when trying to 
reason about Hudi performance. As the saying goes : there is no free lunch (not 
yet atleast)
+
+| Storage Type | Type of workload | Performance | Tips |
+| ---| ---| ---| --- |
+| copy on write | bulk\_insert | Should match vanilla spark writing + an 
additional sort to properly size files | properly size [bulk insert 
parallelism](https://hudi.apache.org/docs/configurations#hoodiebulkinsertshuffleparallelism)
 to get right number of files. use insert if you want this auto tuned . 
Configure 
[hoodie.bulkinsert.sort.mode](https://hudi.apache.org/docs/configurations#hoodiebulkinsertsortmode)
 for better file sizes at the cost of memory. The default value NONE offers the 
fastest performance and matches `spark.write.parquet()` in terms of number of 
files, overheads. |
+| copy on write | insert | Similar to bulk insert, except the file sizes are 
auto tuned requiring input to be cached into memory and custom partitioned. | 
Performance would be bound by how parallel you can write the ingested data. 
Tune [this 
limit](https://hudi.apache.org/docs/configurations#hoodieinsertshuffleparallelism)
 up, if you see that writes are happening from only a few executors. |
+| copy on write | upsert/ de-duplicate & insert | Both of these would involve 
index lookup. Compared to naively using Spark (or similar framework)'s JOIN to 
identify the affected records, Hudi indexing is often 7-10x faster as long as 
you have ordered keys (discussed below) or <50% updates. Compared to naively 
overwriting entire partitions, Hudi write can be several magnitudes faster 
depending on how many files in a given partition is actually updated. For e.g, 
if a partition has 1000 files out of which only 100 is dirtied every ingestion 
run, then Hudi would only read/merge a total of 100 files and thus 10x faster 
than naively rewriting entire partition. | Ultimately performance would be 
bound by how quickly we can read and write a parquet file and that depends on 
the size of the parquet file, configured 
[here](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize). 
Also be sure to properly tune your [bloom 
filters](https://hudi.apache.org/docs/configurations#INDEX).
  [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) will auto-tune this. 
|
+| merge on read | bulk insert | Currently new data only goes to parquet files 
and thus performance here should be similar to copy\_on\_write bulk insert. 
This has the nice side-effect of getting data into parquet directly for query 
performance. [HUDI-86](https://issues.apache.org/jira/browse/HUDI-86) will add 
support for logging inserts directly and this up drastically. |  |
+| merge on read | insert | Similar to above |  |
+| merge on read | upsert/ de-duplicate & insert | Indexing performance would 
remain the same as copy-on-write, while ingest latency for updates (costliest 
I/O operation in copy\_on\_write) are sent to log files and thus with 
asynchronous compaction provides very very good ingest performance with low 
write amplification. |  |
+
+Like with many typical system that manage time-series data, Hudi performs much 
better if your keys have a timestamp prefix or monotonically 
increasing/decreasing. You can almost always achieve this. Even if you have 
UUID keys, you can follow tricks like 
[this](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/) to 
get keys that are ordered. See also [Tuning 
Guide](https://hudi.apache.org/docs/tuning-guide) for more tips on JVM and 
other configurations.
+
+### What performance can I expect for Hudi reading/queries?
+
+*   For ReadOptimized views, you can expect the same best in-class columnar 
query performance as a standard parquet table in Hive/Spark/Presto
+*   For incremental views, you can expect speed up relative to how much data 
usually changes in a given time window and how much time your entire scan 
takes. For e.g, if only 100 files changed in the last hour in a partition of 
1000 files, then you can expect a speed of 10x using incremental pull in Hudi 
compared to full scanning the partition to find out new data.
+*   For real time views, you can expect performance similar to the same avro 
backed table in Hive/Spark/Presto
+
+### How do I to avoid creating tons of small files?
+
+A key design decision in Hudi was to avoid creating small files and always 
write properly sized files.
+
+There are 2 ways to avoid creating tons of small files in Hudi and both of 
them have different trade-offs:
+
+a) **Auto Size small files during ingestion**: This solution trades 
ingest/writing time to keep queries always efficient. Common approaches to 
writing very small files and then later stitching them together only solve for 
system scalability issues posed by small files and also let queries slow down 
by exposing small files to them anyway.
+
+Hudi has the ability to maintain a configured target file size, when 
performing **upsert/insert** operations. (Note: **bulk\_insert** operation does 
not provide this functionality and is designed as a simpler replacement for 
normal `spark.write.parquet` )
+
+For **copy-on-write**, this is as simple as configuring the [maximum size for 
a base/parquet 
file](https://hudi.apache.org/docs/configurations#hoodieparquetmaxfilesize) and 
the [soft 
limit](https://hudi.apache.org/docs/configurations#hoodieparquetsmallfilelimit) 
below which a file should be considered a small file. For the initial bootstrap 
to Hudi table, tuning record size estimate is also important to ensure 
sufficient records are bin-packed in a parquet file. For subsequent writes, 
Hudi automatically uses average record size based on previous commit. Hudi will 
try to add enough records to a small file at write time to get it to the 
configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and 
limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 
120MB.
+
+For **merge-on-read**, there are few more configs to set. MergeOnRead works 
differently for different INDEX choices.
+
+*   Indexes with **canIndexLogFiles = true** : Inserts of new data go directly 
to log files. In this case, you can configure the [maximum log 
size](https://hudi.apache.org/docs/configurations#hoodielogfilemaxsize) and a 
[factor](https://hudi.apache.org/docs/configurations#hoodielogfiletoparquetcompressionratio)
 that denotes reduction in size when data moves from avro to parquet files.
+*   Indexes with **canIndexLogFiles = false** : Inserts of new data go only to 
parquet files. In this case, the same configurations as above for the 
COPY\_ON\_WRITE case applies.
+
+NOTE : In either case, small files will be auto sized only if there is no 
PENDING compaction or associated log file for that particular file slice. For 
example, for case 1: If you had a log file and a compaction C1 was scheduled to 
convert that log file to parquet, no more inserts can go into that log file. 
For case 2: If you had a parquet file and an update ended up creating an 
associated delta log file, no more inserts can go into that parquet file. Only 
after the compaction has been performed and there are NO log files associated 
with the base parquet file, can new inserts be sent to auto size that parquet 
file.
+
+b) 
[**Clustering**](https://hudi.apache.org/blog/2021/01/27/hudi-clustering-intro) 
: This is a feature in Hudi to group small files into larger ones either 
synchronously or asynchronously. Since first solution of auto-sizing small 
files has a tradeoff on ingestion speed (since the small files are sized during 
ingestion), if your use-case is very sensitive to ingestion latency where you 
don't want to compromise on ingestion speed which may end up creating a lot of 
small files, clustering comes to the rescue. Clustering can be scheduled 
through the ingestion job and an asynchronus job can stitch small files 
together in the background to generate larger files. NOTE that during this, 
ingestion can continue to run concurrently.
+
+_Please note that Hudi always creates immutable files on disk. To be able to 
do auto-sizing or clustering, Hudi will always create a newer version of the 
smaller file, resulting in 2 versions of the same file. The cleaner service 
will later kick in and delte the older version small file and keep the latest 
one._
+
+### How do I use DeltaStreamer or Spark DataSource API to write to a 
Non-partitioned Hudi table ?
+
+Hudi supports writing to non-partitioned tables. For writing to a 
non-partitioned Hudi table and performing hive table syncing, you need to set 
the below configurations in the properties passed:
+
+```plain
+hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
+hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
+```
+
+### How can I reduce table versions created by Hudi in AWS Glue Data Catalog/ 
metastore?
+
+With each commit, Hudi creates a new table version in the metastore. This can 
be reduced by setting the option
+
+[hoodie.datasource.meta\_sync.condition.sync](https://hudi.apache.org/docs/configurations#hoodiedatasourcemeta_syncconditionsync)
 to true.
+
+This will ensure that hive sync is triggered on schema or partitions changes.
+
+### If there are failed writes in my timeline, do I see duplicates?
+
+No, Hudi does not expose uncommitted files/blocks to the readers. Further, 
Hudi strives to automatically manage the table for the user, by actively 
cleaning up files created from failed/aborted writes. See [marker 
mechanism](https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/).
+
+### How are conflicts detected in Hudi between multiple writers?
+
+Hudi employs [optimistic concurrency 
control](https://hudi.apache.org/docs/concurrency_control#supported-concurrency-controls)
 between writers, while implementing MVCC based concurrency control between 
writers and the table services. Concurrent writers to the same table need to be 
configured with the same lock provider configuration, to safely perform writes. 
By default (implemented in 
“[SimpleConcurrentFileWritesConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/SimpleConcurrentFileWritesConflictResolutionStrategy.java)”),
 Hudi allows multiple writers to concurrently write data and commit to the 
timeline if there is no conflicting writes to the same underlying file group 
IDs. This is achieved by holding a lock, checking for changes that modified the 
same file IDs. Hudi then supports a pluggable interface 
“[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/h
 
udi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)”
 that determines how conflicts are handled. By default, the later conflicting 
write is aborted. Hudi also support eager conflict detection to help speed up 
conflict detection and release cluster resources back early to reduce costs.
+
+### Can single-writer inserts have duplicates?
+
+By default, Hudi turns off key based de-duplication for INSERT/BULK\_INSERT 
operations and thus the table could contain duplicates. If users believe, they 
have duplicates in inserts, they can either issue UPSERT or consider specifying 
the option to de-duplicate input in either 
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates)
 or 
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L229).
+
+### Can concurrent inserts cause duplicates?
+
+Yes. As mentioned before, the default conflict detection strategy only check 
for conflicting updates to the same file group IDs. In the case of concurrent 
inserts, inserted records end up creating new file groups and thus can go 
undetected. Most common workload patterns use multi-writer capability in the 
case of running ingestion of new data and concurrently backfilling/deleting 
older data, with NO overlap in the primary keys of the records. However, this 
can be implemented (or better yet contributed) by a new 
“[ConflictResolutionStrategy](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/ConflictResolutionStrategy.java)”,
 that reads out keys of new conflicting operations, to check the uncommitted 
data against other concurrent writes and then decide whether or not to 
commit/abort. This is rather a fine tradeoff between saving the additional cost 
of reading keys on most common workloads. Historically, user
 s have preferred to take this into their control to save costs e.g we turned 
off de-duplication for inserts due to the same feedback. Hudi supports a 
pre-commit validator mechanism already where such tests can be authored as well.
+
+## Querying Tables
+
+### Does deleted records appear in Hudi's incremental query results?
+
+Soft Deletes (unlike hard deletes) do appear in the incremental pull query 
results. So, if you need a mechanism to propagate deletes to downstream tables, 
you can use Soft deletes.
+
 ### How do I pass hudi configurations to my beeline Hive queries?
 
 If Hudi's input format is not picked the returned results may be incorrect. To 
ensure correct inputformat is picked, please use 
`org.apache.hadoop.hive.ql.io.HiveInputFormat` or 
`org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat` for 
`hive.input.format` config. This can be set like shown below:
-```java
+
+```plain
 set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
 ```
 
 or
 
-```java
+```plain
 set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat
 ```
 
-### Can I register my Hudi dataset with Apache Hive metastore?
+### Why do we have to set 2 different ways of configuring Spark Queries to 
work with Hudi?

Review Comment:
   I removed this question since this is outdated. Different query engines use 
different methods based on query types. So we can revisit again with a broad 
answer not specifically to Spark in a later PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a diff in pull request #9231: [HUDI-6552][DOCS] Restructure and reorganize FAQ

Reply via email to