This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new e258bfe787 [DOCS] Edit quickstart (#7120) e258bfe787 is described below commit e258bfe7878a823b5c490b788f453c87a0f10649 Author: nfarah86 <nfara...@gmail.com> AuthorDate: Wed Nov 2 12:04:25 2022 -0700 [DOCS] Edit quickstart (#7120) * updated python time travel query to fix error * updated the tabs to see if it helps with the copying- there is an error on how the copy command is working * fixed python hard and soft deletes copying- there was weird behavior occuring. * made some stylistic changes * updated code overview to be under subsection * fixed the misalignment of text in hard and soft deletes Co-authored-by: nadine <nfarah@nadines-MacBook-Pro.local> --- website/docs/quick-start-guide.md | 45 ++++++++++++++++++++++++++------------- 1 file changed, 30 insertions(+), 15 deletions(-) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index c610964f6c..e00f6cdb25 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -635,7 +635,7 @@ spark.read. \ spark.read. \ format("hudi"). \ - option("as.of.instant", "2021-07-28 14: 11: 08"). \ + option("as.of.instant", "2021-07-28 14:11:08.000"). \ load(basePath) # It is equal to "as.of.instant = 2021-07-28 00:00:00" @@ -959,13 +959,15 @@ spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hud ## Delete data {#deletes} -Apache Hudi supports two types of deletes: (1) **Soft Deletes**: retaining the record key and just nulling out the values -for all the other fields (records with nulls in soft deletes are always persisted in storage and never removed); -(2) **Hard Deletes**: physically removing any trace of the record from the table. See the -[deletion section](/docs/writing_data#deletes) of the writing data page for more details. +Apache Hudi supports two types of deletes: <br/> +1. **Soft Deletes**: This retains the record key and just nulls out the values for all the other fields. The records with nulls in soft deletes are always persisted in storage and never removed. +2. **Hard Deletes**: This physically removes any trace of the record from the table. Check out the +[deletion section](/docs/writing_data#deletes) for more details. ### Soft Deletes +Soft deletes retain the record key and null out the values for all the other fields. For example, records with nulls in soft deletes are always persisted in storage and never removed.<br/><br/> + <Tabs defaultValue="scala" values={[ @@ -1028,6 +1030,10 @@ Notice that the save mode is `Append`. </TabItem> <TabItem value="python"> +:::note +Notice that the save mode is `Append`. +::: + ```python # pyspark from pyspark.sql.functions import lit @@ -1036,9 +1042,11 @@ from functools import reduce spark.read.format("hudi"). \ load(basePath). \ createOrReplaceTempView("hudi_trips_snapshot") + # fetch total records count spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is not null").count() + # fetch two records for soft deletes soft_delete_ds = spark.sql("select * from hudi_trips_snapshot").limit(2) @@ -1046,6 +1054,8 @@ soft_delete_ds = spark.sql("select * from hudi_trips_snapshot").limit(2) meta_columns = ["_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_record_key", \ "_hoodie_partition_path", "_hoodie_file_name"] excluded_columns = meta_columns + ["ts", "uuid", "partitionpath"] +``` +```python nullify_columns = list(filter(lambda field: field[0] not in excluded_columns, \ list(map(lambda field: (field.name, field.dataType), soft_delete_ds.schema.fields)))) @@ -1079,16 +1089,14 @@ spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() # This should return (total - 2) count as two records are updated with nulls spark.sql("select uuid, partitionpath from hudi_trips_snapshot where rider is not null").count() ``` -:::note -Notice that the save mode is `Append`. -::: + </TabItem> </Tabs > - ### Hard Deletes +Hard deletes physically remove any trace of the record from the table. For example, this deletes records for the HoodieKeys passed in.<br/><br/> <Tabs defaultValue="scala" @@ -1155,7 +1163,11 @@ delete from hudi_cow_pt_tbl where name = 'a1'; </TabItem> <TabItem value="python"> -Delete records for the HoodieKeys passed in.<br/> + + +:::note +Only `Append` mode is supported for delete operation. +::: ```python # pyspark @@ -1188,19 +1200,22 @@ hard_delete_df.write.format("hudi"). \ roAfterDeleteViewDF = spark. \ read. \ format("hudi"). \ - load(basePath) -roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot") + load(basePath) +``` + +```python +roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot") + # fetch should return (total - 2) records spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() ``` -:::note -Only `Append` mode is supported for delete operation. -::: + </TabItem> </Tabs > + ## Insert Overwrite Generate some new trips, overwrite the all the partitions that are present in the input. This operation can be faster