This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e258bfe787 [DOCS] Edit quickstart (#7120)
e258bfe787 is described below
commit e258bfe7878a823b5c490b788f453c87a0f10649
Author: nfarah86 <[email protected]>
AuthorDate: Wed Nov 2 12:04:25 2022 -0700
[DOCS] Edit quickstart (#7120)
* updated python time travel query to fix error
* updated the tabs to see if it helps with the copying- there is an error
on how the copy command is working
* fixed python hard and soft deletes copying- there was weird behavior
occuring.
* made some stylistic changes
* updated code overview to be under subsection
* fixed the misalignment of text in hard and soft deletes
Co-authored-by: nadine <[email protected]>
---
website/docs/quick-start-guide.md | 45 ++++++++++++++++++++++++++-------------
1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/website/docs/quick-start-guide.md
b/website/docs/quick-start-guide.md
index c610964f6c..e00f6cdb25 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -635,7 +635,7 @@ spark.read. \
spark.read. \
format("hudi"). \
- option("as.of.instant", "2021-07-28 14: 11: 08"). \
+ option("as.of.instant", "2021-07-28 14:11:08.000"). \
load(basePath)
# It is equal to "as.of.instant = 2021-07-28 00:00:00"
@@ -959,13 +959,15 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon,
begin_lat, ts from hud
## Delete data {#deletes}
-Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the
record key and just nulling out the values
-for all the other fields (records with nulls in soft deletes are always
persisted in storage and never removed);
-(2) **Hard Deletes**: physically removing any trace of the record from the
table. See the
-[deletion section](/docs/writing_data#deletes) of the writing data page for
more details.
+Apache Hudi supports two types of deletes: <br/>
+1. **Soft Deletes**: This retains the record key and just nulls out the
values for all the other fields. The records with nulls in soft deletes are
always persisted in storage and never removed.
+2. **Hard Deletes**: This physically removes any trace of the record from the
table. Check out the
+[deletion section](/docs/writing_data#deletes) for more details.
### Soft Deletes
+Soft deletes retain the record key and null out the values for all the other
fields. For example, records with nulls in soft deletes are always persisted in
storage and never removed.<br/><br/>
+
<Tabs
defaultValue="scala"
values={[
@@ -1028,6 +1030,10 @@ Notice that the save mode is `Append`.
</TabItem>
<TabItem value="python">
+:::note
+Notice that the save mode is `Append`.
+:::
+
```python
# pyspark
from pyspark.sql.functions import lit
@@ -1036,9 +1042,11 @@ from functools import reduce
spark.read.format("hudi"). \
load(basePath). \
createOrReplaceTempView("hudi_trips_snapshot")
+
# fetch total records count
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is
not null").count()
+
# fetch two records for soft deletes
soft_delete_ds = spark.sql("select * from hudi_trips_snapshot").limit(2)
@@ -1046,6 +1054,8 @@ soft_delete_ds = spark.sql("select * from
hudi_trips_snapshot").limit(2)
meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno",
"_hoodie_record_key", \
"_hoodie_partition_path", "_hoodie_file_name"]
excluded_columns = meta_columns + ["ts", "uuid", "partitionpath"]
+```
+```python
nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \
list(map(lambda field: (field.name, field.dataType),
soft_delete_ds.schema.fields))))
@@ -1079,16 +1089,14 @@ spark.sql("select uuid, partitionpath from
hudi_trips_snapshot").count()
# This should return (total - 2) count as two records are updated with nulls
spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is
not null").count()
```
-:::note
-Notice that the save mode is `Append`.
-:::
+
</TabItem>
</Tabs
>
-
### Hard Deletes
+Hard deletes physically remove any trace of the record from the table. For
example, this deletes records for the HoodieKeys passed in.<br/><br/>
<Tabs
defaultValue="scala"
@@ -1155,7 +1163,11 @@ delete from hudi_cow_pt_tbl where name = 'a1';
</TabItem>
<TabItem value="python">
-Delete records for the HoodieKeys passed in.<br/>
+
+
+:::note
+Only `Append` mode is supported for delete operation.
+:::
```python
# pyspark
@@ -1188,19 +1200,22 @@ hard_delete_df.write.format("hudi"). \
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
- load(basePath)
-roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
+ load(basePath)
+```
+
+```python
+roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
+
# fetch should return (total - 2) records
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
```
-:::note
-Only `Append` mode is supported for delete operation.
-:::
+
</TabItem>
</Tabs
>
+
## Insert Overwrite
Generate some new trips, overwrite the all the partitions that are present in
the input. This operation can be faster