[
https://issues.apache.org/jira/browse/HUDI-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405891#comment-17405891
]
ASF GitHub Bot commented on HUDI-2317:
--------------------------------------
vinothchandar commented on a change in pull request #3497:
URL: https://github.com/apache/hudi/pull/3497#discussion_r697543536
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types:
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
+(BLOOM, etc) in future releases.
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with
virtual keys except the incremental
+query support. Which means, cleaning, archiving, metadata table, clustering,
etc can be enabled for a hudi table with
+virtual keys enabled. So, if one's requirement fits into this model, would
recommend using virtual keys as it reduces
+the storage overhead.
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys
are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME.key(), tableName).
+ option("hoodie.populate.meta.fields", "false").
+ option("hoodie.index.type","SIMPLE").
+ mode(Overwrite).
+ save(basePath)
+```
+
+### Query
Review comment:
are we just trying to show that queries work. If so, lets remove this?
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types:
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
+(BLOOM, etc) in future releases.
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with
virtual keys except the incremental
+query support. Which means, cleaning, archiving, metadata table, clustering,
etc can be enabled for a hudi table with
+virtual keys enabled. So, if one's requirement fits into this model, would
recommend using virtual keys as it reduces
+the storage overhead.
+
+## Code snippet
Review comment:
can we use a much simpler quickstart example using some other schema.
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
Review comment:
lets please use actual config names and values and avoid referring
loosely to class names out of context. I would argue these are downright
hostile for reader friendliness :) .
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types:
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
+(BLOOM, etc) in future releases.
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with
virtual keys except the incremental
+query support. Which means, cleaning, archiving, metadata table, clustering,
etc can be enabled for a hudi table with
+virtual keys enabled. So, if one's requirement fits into this model, would
recommend using virtual keys as it reduces
+the storage overhead.
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys
are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME.key(), tableName).
+ option("hoodie.populate.meta.fields", "false").
+ option("hoodie.index.type","SIMPLE").
+ mode(Overwrite).
+ save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+ read.
+ format("hudi").
+ load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark
auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot
where fare > 20.0").show()
+```
+
+#### Output
Review comment:
can we remove this `Output` subheading?
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
Review comment:
do they need to be sections? Can we do a table? its easy to convey these
thigns in a table?
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types:
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
+(BLOOM, etc) in future releases.
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with
virtual keys except the incremental
+query support. Which means, cleaning, archiving, metadata table, clustering,
etc can be enabled for a hudi table with
+virtual keys enabled. So, if one's requirement fits into this model, would
recommend using virtual keys as it reduces
+the storage overhead.
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys
are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME.key(), tableName).
+ option("hoodie.populate.meta.fields", "false").
+ option("hoodie.index.type","SIMPLE").
+ mode(Overwrite).
+ save(basePath)
+```
+
+### Query
+```
+val tripsSnapshotDF = spark.
+ read.
+ format("hudi").
+ load(basePath + "/*/*/*/*")
+//load(basePath) use "/partitionKey=partitionValue" folder structure for Spark
auto partition discovery
+tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
+```
+
+```
+spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot
where fare > 20.0").show()
+```
+
+#### Output
+```
++------------------+-------------------+-------------------+-------------+
+| fare| begin_lon| begin_lat| ts|
++------------------+-------------------+-------------------+-------------+
+| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1628951609798|
+| 93.56018115236618|0.14285051259466197|0.21624150367601136|1629012489526|
+| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1629163264651|
+| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1628701606278|
+| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1628787101240|
+| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1628802740084|
+|34.158284716382845|0.46157858450465483| 0.4726905879569653|1629018593339|
+| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1629131594334|
++------------------+-------------------+-------------------+-------------+
+```
+
+```
+spark.sql("select uuid, partitionpath, rider, driver, fare from
hudi_trips_snapshot").show(false)
+```
+
+#### Output
+```
++------------------------------------+------------------------------------+---------+----------+------------------+
+|uuid |partitionpath
|rider |driver |fare |
++------------------------------------+------------------------------------+---------+----------+------------------+
+|eb7819f1-6f04-429d-8371-df77620b9527|americas/united_states/san_francisco|rider-213|driver-213|27.79478688582596
|
+|37ea44f1-fda7-4ec4-84de-f43f5b5a4d84|americas/united_states/san_francisco|rider-213|driver-213|19.179139106643607|
+|aa601d6b-7cc5-4b82-9687-675d0081616e|americas/united_states/san_francisco|rider-213|driver-213|93.56018115236618
|
+|494bc080-881c-48be-8f8a-8f1739781816|americas/united_states/san_francisco|rider-213|driver-213|33.92216483948643
|
+|09573277-e1c1-4cdd-9b45-57176f184d4d|americas/united_states/san_francisco|rider-213|driver-213|64.27696295884016
|
+|c9b055ed-cd28-4397-9704-93da8b2e601f|americas/brazil/sao_paulo
|rider-213|driver-213|43.4923811219014 |
+|e707355a-b8c0-432d-a80f-723b93dc13a8|americas/brazil/sao_paulo
|rider-213|driver-213|66.62084366450246 |
+|d3c39c9e-d128-497a-bf3e-368882f45c28|americas/brazil/sao_paulo
|rider-213|driver-213|34.158284716382845|
+|159441b0-545b-460a-b671-7cc2d509f47b|asia/india/chennai
|rider-213|driver-213|41.06290929046368 |
+|16031faf-ad8d-4968-90ff-16cead211d3c|asia/india/chennai
|rider-213|driver-213|17.851135255091155|
++------------------------------------+------------------------------------+---------+----------+------------------+
+```
+
+```
+spark.sql("select _hoodie_commit_time, _hoodie_record_key,
_hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
+```
+
+#### Output
+```
++-------------------+------------------+----------------------+---------+----------+------------------+
+|_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path| rider|
driver| fare|
++-------------------+------------------+----------------------+---------+----------+------------------+
+| null| null|
null|rider-213|driver-213|19.179139106643607|
+| null| null|
null|rider-213|driver-213| 33.92216483948643|
+| null| null|
null|rider-213|driver-213| 27.79478688582596|
+| null| null|
null|rider-213|driver-213| 64.27696295884016|
+| null| null|
null|rider-213|driver-213| 93.56018115236618|
+| null| null|
null|rider-213|driver-213| 66.62084366450246|
+| null| null|
null|rider-213|driver-213| 43.4923811219014|
+| null| null|
null|rider-213|driver-213|34.158284716382845|
+| null| null|
null|rider-213|driver-213|17.851135255091155|
+| null| null|
null|rider-213|driver-213| 41.06290929046368|
++-------------------+------------------+----------------------+---------+----------+------------------+
+```
+Note: all meta fields are null in storage.
Review comment:
these `Note:` style off hand comments, actually intefere a fair bit with
reading flow. :)
##########
File path: website/blog/2021-08-18-virtual-keys.md
##########
@@ -0,0 +1,299 @@
+---
+title: "Virtual keys support in Hudi"
+excerpt: "Supporting Virtual keys in Hudi by reducing storage overhead"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi helps you build and manage data lakes with different table types,
config knobs to cater to everyone's need.
+Hudi adds per record metadata like the record key, partition path, commit time
etc which serves multiple purpose.
+This assists in avoiding re-computing the record key, partition path during
merges, compaction and other table operations
+and also assists in supporting incremental queries. But one of the repeated
asks from the community is to leverage
+existing fields and not to add additional meta fields. So, Hudi is adding
Virtual keys support to cater to such needs.
+<!--truncate-->
+
+# Virtual key support
+Hudi now supports Virtual keys, where Hudi meta fields can be computed on
demand from existing user
+fields for all records. In regular path, these are computed once and stored as
per record metadata and re-used during
+various operations like merging incoming records to those in storage,
compaction, etc. Hudi also stores commit time at
+record level to support incremental queries. If one does not need incremental
support, they can start leverageing
+Hudi's Virutal key support and still go about using Hudi to build and manage
their data lake to reduce the storage
+overhead due to per record metadata.
+
+## Configurations
+Virtual keys can be enabled for a given table using the below config. When
disabled,
+Hudi will enforce virtual keys for the corresponding table. Default value for
this config is true, which means, all
+meta fields will be added by default. <br/> <br/>
+`"hoodie.populate.meta.fields"`
+
+Note:
+Once virtual keys are enabled, it can't be disabled for a given hudi table,
because already stored records may not have
+the meta fields populated. But if you have an existing table from an older
version of hudi, virtual keys can be enabled.
+Just that going back is not feasible.
+Another constraint wrt virtual key support is that, Key generator properties
for a given table cannot be changed through
+the course of the lifecycle of a given hudi table.
+For instance, if you configure record key to point to field5 for few batches
of write and later switch to field10,
+it may not pan out well with hudi table where virtual keys are enabled.
+
+As its evident, record keys and partition path will have to be re-computed
everytime when in need (merges, compaction,
+MOR snapshot read). Hence we are supporting only built-in key generators with
Virtual Keys for COW table type. Incase of
+MOR, we support only SimpleKeyGenerator (i.e. both record key and partition
path has to refer
+to an existing user field ) for now. If we zoom into Merge On Read table's
snapshot query, hudi does real time merging of base
+data file with records from delta log files and hence query latencies will
shoot up if we were to support all different
+types of key generators.
+
+### Supported Key Generators with CopyOnWrite(COW) table:
+SimpleKeyGenerator, ComplexKeyGenerator, CustomKeyGenerator,
TimestampBasedKeyGenerator and NonPartitionedKeyGenerator.
+
+### Supported Key Generators with MergeOnRead(MOR) table:
+SimpleKeyGenerator
+
+### Supported Index types:
+Only "SIMPLE" and "GLOBAL_SIMPLE" index types are supported in the first cut.
We plan to add support for other index
+(BLOOM, etc) in future releases.
+
+## Supported Operations
+Good news is that, all existing operations are supported for a hudi table with
virtual keys except the incremental
+query support. Which means, cleaning, archiving, metadata table, clustering,
etc can be enabled for a hudi table with
+virtual keys enabled. So, if one's requirement fits into this model, would
recommend using virtual keys as it reduces
+the storage overhead.
+
+## Code snippet
+We can go through our quick start and see how it plays out when virtual keys
are enabled.
+
+### Inserts
+```
+// spark-shell
+import org.apache.hudi.QuickstartUtils._
+import scala.collection.JavaConversions._
+import org.apache.spark.sql.SaveMode._
+import org.apache.hudi.DataSourceReadOptions._
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.config.HoodieWriteConfig._
+
+val tableName = "hudi_trips_cow"
+val basePath = "file:///tmp/hudi_trips_cow"
+val dataGen = new DataGenerator
+
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("hudi").
+ options(getQuickstartWriteConfigs).
+ option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+ option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+ option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+ option(TABLE_NAME.key(), tableName).
+ option("hoodie.populate.meta.fields", "false").
+ option("hoodie.index.type","SIMPLE").
+ mode(Overwrite).
+ save(basePath)
+```
+
+### Query
Review comment:
I feel we can just show that fields are null and incremental queries
will fail. why go over the entire quickstart? it feels like adding little
value, while increasing the length of the blog.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Publish a blog on virtual keys
> ------------------------------
>
> Key: HUDI-2317
> URL: https://issues.apache.org/jira/browse/HUDI-2317
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Docs
> Reporter: sivabalan narayanan
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)