nsivabalan commented on code in PR #9705: URL: https://github.com/apache/hudi/pull/9705#discussion_r1325292223
########## website/src/pages/tech-specs.md: ########## @@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state of the Hudi table can Hudi automatically extracts the physical data statistics and stores the metadata along with the data to improve write and query performance. Hudi Metadata is an internally-managed table which organizes the table metadata under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, organized with the Hudi merge-on-read storage format. Every record stored in the metadata table is a Hudi record and hence has partitioning key and record key specified. Following are the metadata table partitions -- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing and do filter based pruning of the scanset during query -- **bloom\_filters** - Bloom filter index to help map a record key to the actual file. The Hudi key is `str_concat(hash(partition name), hash(file name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. Bloom filter is used to accelerate 'presence checks' validating whether particular record is present in the file, which is used during merging, hash-based joins, point-lookup queries, etc. -- **column\_stats** - contains statistics of columns for all the records in the table. This enables fine grained file pruning for filters and join conditions in the query. The actual payload is an instance of [HoodieMetadataColumnStats][17]. +- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the +actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has +fields `size` and `isDeleted` which provide information about size of the file and whether file has been deleted. +The files index can be used to do file listing and do filter based pruning of the scanset during query. +- **bloom\_filters** - Bloom filter index to help map a record key to the actual file. The Hudi key is +`str_concat(hash(partition name), hash(file name))` and the actual payload is an instance of +[HudiMetadataBloomFilter][16]. HudiMetadataBloomFilter has fields `type`(type code of the bloom filter), +`timestamp`(timestamp when the bloom filter was created/updated), `bloomFilter`(the actual bloom filter for +the data file) and `isDeleted`(whether the bloom filter entr is valid). Bloom filter is used to accelerate Review Comment: lets try to pictorially represent instead of verbally explaining. something like table w/ field name, data type, desc also should work ########## website/src/pages/tech-specs.md: ########## @@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state of the Hudi table can Hudi automatically extracts the physical data statistics and stores the metadata along with the data to improve write and query performance. Hudi Metadata is an internally-managed table which organizes the table metadata under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, organized with the Hudi merge-on-read storage format. Every record stored in the metadata table is a Hudi record and hence has partitioning key and record key specified. Following are the metadata table partitions -- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing and do filter based pruning of the scanset during query -- **bloom\_filters** - Bloom filter index to help map a record key to the actual file. The Hudi key is `str_concat(hash(partition name), hash(file name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. Bloom filter is used to accelerate 'presence checks' validating whether particular record is present in the file, which is used during merging, hash-based joins, point-lookup queries, etc. -- **column\_stats** - contains statistics of columns for all the records in the table. This enables fine grained file pruning for filters and join conditions in the query. The actual payload is an instance of [HoodieMetadataColumnStats][17]. +- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the +actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has +fields `size` and `isDeleted` which provide information about size of the file and whether file has been deleted. Review Comment: instead of verbally explaining, can you add it as bullet list or a table (check out how we call out log file format in our tech specs). ########## website/src/pages/tech-specs.md: ########## @@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state of the Hudi table can Hudi automatically extracts the physical data statistics and stores the metadata along with the data to improve write and query performance. Hudi Metadata is an internally-managed table which organizes the table metadata under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, organized with the Hudi merge-on-read storage format. Every record stored in the metadata table is a Hudi record and hence has partitioning key and record key specified. Following are the metadata table partitions -- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing and do filter based pruning of the scanset during query -- **bloom\_filters** - Bloom filter index to help map a record key to the actual file. The Hudi key is `str_concat(hash(partition name), hash(file name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. Bloom filter is used to accelerate 'presence checks' validating whether particular record is present in the file, which is used during merging, hash-based joins, point-lookup queries, etc. -- **column\_stats** - contains statistics of columns for all the records in the table. This enables fine grained file pruning for filters and join conditions in the query. The actual payload is an instance of [HoodieMetadataColumnStats][17]. +- **files** - Partition path to file name index. Key for the Hudi record is the partition path and the +actual record is a map of file name to an instance of [HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has +fields `size` and `isDeleted` which provide information about size of the file and whether file has been deleted. +The files index can be used to do file listing and do filter based pruning of the scanset during query. +- **bloom\_filters** - Bloom filter index to help map a record key to the actual file. The Hudi key is +`str_concat(hash(partition name), hash(file name))` and the actual payload is an instance of +[HudiMetadataBloomFilter][16]. HudiMetadataBloomFilter has fields `type`(type code of the bloom filter), +`timestamp`(timestamp when the bloom filter was created/updated), `bloomFilter`(the actual bloom filter for +the data file) and `isDeleted`(whether the bloom filter entr is valid). Bloom filter is used to accelerate +'presence checks' validating whether particular record is present in the file, which is used during merging, +hash-based joins, point-lookup queries, etc. +- **column\_stats** - contains statistics of columns for all the records in the table. This enables fine +grained file pruning for filters and join conditions in the query. The actual payload is an instance of +[HoodieMetadataColumnStats][17]. +HoodieMetadataColumnStats has fields `fileName`(file name for which the +column stat applies), `columnName`(column name for which the column stat apples), `minValue`(minimum value +of the column in the file), `maxValue`(maximum value of the column in the file), `valueCount`(total count of +values), `nullCount`(total count of null values), `totalSize`(total storage size on disk), `totalUncompressedSize` +(total uncompressed storage size on disk) and `isDeleted`(whether the column stat entry is valid). +- **record\_index** - contains information about record keys and their location in the dataset. This improves +performance of updates since it provides file locations for the updated records and also enables fine grained +file pruning for filters and join conditions in the query. The payload is an instance of +[HoodieRecordIndexInfo][18]. +HoodieRecordIndexInfo has fields `partitionName`(partition name to which the Review Comment: same comment as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
