KarthickAN opened a new issue #2178:
URL: https://github.com/apache/hudi/issues/2178
Hi,
I tried inspecting the parquet files produced by hudi using the parquet
tools. Each parquet file produced by hudi contains around 10MB worth of data
for the field **extra: org.apache.hudi.bloomfilter** where the actual data is
in KB's. As per the doc every 50000 bloom entries is supposed to be 4KB. Is
this expected behavior or am I missing something here ? Below are the configs
which I am using currently.
SmallFileSize = 104857600
MaxFileSize = 125829120
RecordSize = 35
CompressionRatio = 5
InsertSplitSize = 3500000
IndexBloomNumEntries = 1500000
KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
TableType = COPY_ON_WRITE
PartitionPathFields = date,sourceid
HiveStylePartitioning = True
WriteOperation = insert
CompressionCodec = snappy
CommitsRetained = 1
CombineBeforeInsert = True
PrecombineField = timestamp
InsertDropDuplicates = True
InsertShuffleParallelism = 100
**Environment Description**
Hudi version : 0.6.0
Spark version : 2.4.3
Hadoop version : 2.8.5-amzn-1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No. Running on AWS Glue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]