[GitHub] [hudi] nsivabalan commented on a diff in pull request #9705: [DOCS] Add Record Index Metadata partition documentation and other schema details

via GitHub Wed, 13 Sep 2023 19:47:30 -0700


nsivabalan commented on code in PR #9705:
URL: https://github.com/apache/hudi/pull/9705#discussion_r1325292223



##########
website/src/pages/tech-specs.md:
##########
@@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state 
of the Hudi table can
 
 Hudi automatically extracts the physical data statistics and stores the 
metadata along with the data to improve write and query performance. Hudi 
Metadata is an internally-managed table which organizes the table metadata 
under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, 
organized with the Hudi merge-on-read storage format. Every record stored in 
the metadata table is a Hudi record and hence has partitioning key and record 
key specified. Following are the metadata table partitions
 
-- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the actual record is a map of file name to an instance 
of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing 
and do filter based pruning of the scanset during query
-- **bloom\_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is `str_concat(hash(partition name), hash(file 
name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. 
Bloom filter is used to accelerate 'presence checks' validating whether 
particular record is present in the file, which is used during merging, 
hash-based joins, point-lookup queries, etc.
-- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine grained file pruning for filters and join 
conditions in the query. The actual payload is an instance of 
[HoodieMetadataColumnStats][17]. 
+- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the 
+actual record is a map of file name to an instance of 
[HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has 
+fields `size` and `isDeleted` which provide information about size of the file 
and whether file has been deleted. 
+The files index can be used to do file listing and do filter based pruning of 
the scanset during query.
+- **bloom\_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is 
+`str_concat(hash(partition name), hash(file name))` and the actual payload is 
an instance of 
+[HudiMetadataBloomFilter][16]. HudiMetadataBloomFilter has fields `type`(type 
code of the bloom filter), 
+`timestamp`(timestamp when the bloom filter was created/updated), 
`bloomFilter`(the actual bloom filter for 
+the data file) and `isDeleted`(whether the bloom filter entr is valid). Bloom 
filter is used to accelerate 

Review Comment:
   lets try to pictorially represent instead of verbally explaining. something 
like table w/ field name, data type, desc also should work 



##########
website/src/pages/tech-specs.md:
##########
@@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state 
of the Hudi table can
 
 Hudi automatically extracts the physical data statistics and stores the 
metadata along with the data to improve write and query performance. Hudi 
Metadata is an internally-managed table which organizes the table metadata 
under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, 
organized with the Hudi merge-on-read storage format. Every record stored in 
the metadata table is a Hudi record and hence has partitioning key and record 
key specified. Following are the metadata table partitions
 
-- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the actual record is a map of file name to an instance 
of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing 
and do filter based pruning of the scanset during query
-- **bloom\_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is `str_concat(hash(partition name), hash(file 
name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. 
Bloom filter is used to accelerate 'presence checks' validating whether 
particular record is present in the file, which is used during merging, 
hash-based joins, point-lookup queries, etc.
-- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine grained file pruning for filters and join 
conditions in the query. The actual payload is an instance of 
[HoodieMetadataColumnStats][17]. 
+- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the 
+actual record is a map of file name to an instance of 
[HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has 
+fields `size` and `isDeleted` which provide information about size of the file 
and whether file has been deleted. 

Review Comment:
   instead of verbally explaining, can you add it as bullet list or a table 
(check out how we call out log file format in our tech specs).



##########
website/src/pages/tech-specs.md:
##########
@@ -154,9 +154,36 @@ By reconciling all the actions in the timeline, the state 
of the Hudi table can
 
 Hudi automatically extracts the physical data statistics and stores the 
metadata along with the data to improve write and query performance. Hudi 
Metadata is an internally-managed table which organizes the table metadata 
under the base path *.hoodie/metadata.* The metadata is in itself a Hudi table, 
organized with the Hudi merge-on-read storage format. Every record stored in 
the metadata table is a Hudi record and hence has partitioning key and record 
key specified. Following are the metadata table partitions
 
-- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the actual record is a map of file name to an instance 
of [HoodieMetadataFileInfo][15]. The files index can be used to do file listing 
and do filter based pruning of the scanset during query
-- **bloom\_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is `str_concat(hash(partition name), hash(file 
name))` and the actual payload is an instance of [HudiMetadataBloomFilter][16]. 
Bloom filter is used to accelerate 'presence checks' validating whether 
particular record is present in the file, which is used during merging, 
hash-based joins, point-lookup queries, etc.
-- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine grained file pruning for filters and join 
conditions in the query. The actual payload is an instance of 
[HoodieMetadataColumnStats][17]. 
+- **files** - Partition path to file name index. Key for the Hudi record is 
the partition path and the 
+actual record is a map of file name to an instance of 
[HoodieMetadataFileInfo][15]. HoodieMetadataFileInfo has 
+fields `size` and `isDeleted` which provide information about size of the file 
and whether file has been deleted. 
+The files index can be used to do file listing and do filter based pruning of 
the scanset during query.
+- **bloom\_filters** - Bloom filter index to help map a record key to the 
actual file. The Hudi key is 
+`str_concat(hash(partition name), hash(file name))` and the actual payload is 
an instance of 
+[HudiMetadataBloomFilter][16]. HudiMetadataBloomFilter has fields `type`(type 
code of the bloom filter), 
+`timestamp`(timestamp when the bloom filter was created/updated), 
`bloomFilter`(the actual bloom filter for 
+the data file) and `isDeleted`(whether the bloom filter entr is valid). Bloom 
filter is used to accelerate 
+'presence checks' validating whether particular record is present in the file, 
which is used during merging, 
+hash-based joins, point-lookup queries, etc.
+- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine 
+grained file pruning for filters and join conditions in the query. The actual 
payload is an instance of 
+[HoodieMetadataColumnStats][17].  
+HoodieMetadataColumnStats has fields `fileName`(file name for which the 
+column stat applies), `columnName`(column name for which the column stat 
apples), `minValue`(minimum value 
+of the column in the file), `maxValue`(maximum value of the column in the 
file), `valueCount`(total count of 
+values), `nullCount`(total count of null values), `totalSize`(total storage 
size on disk), `totalUncompressedSize`
+(total uncompressed storage size on disk) and `isDeleted`(whether the column 
stat entry is valid).
+- **record\_index** - contains information about record keys and their 
location in the dataset. This improves 
+performance of updates since it provides file locations for the updated 
records and also enables fine grained 
+file pruning for filters and join conditions in the query. The payload is an 
instance of 
+[HoodieRecordIndexInfo][18].  
+HoodieRecordIndexInfo has fields `partitionName`(partition name to which the 

Review Comment:
   same comment as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9705: [DOCS] Add Record Index Metadata partition documentation and other schema details

Reply via email to