KarthickAN opened a new issue #2178:
URL: https://github.com/apache/hudi/issues/2178


   Hi,
      I tried inspecting the parquet files produced by hudi  using the parquet 
tools. Each parquet file produced by hudi contains around 10MB worth of data 
for the field **extra: org.apache.hudi.bloomfilter** where the actual data is 
in KB's. As per the doc every 50000 bloom entries is supposed to be 4KB. Is 
this expected behavior or am I missing something here ? Below are the configs 
which I am using currently.
   
   
   SmallFileSize = 104857600
   MaxFileSize = 125829120
   RecordSize = 35
   CompressionRatio = 5
   InsertSplitSize = 3500000
   IndexBloomNumEntries = 1500000
   KeyGenClass = org.apache.hudi.keygen.ComplexKeyGenerator
   RecordKeyFields = sourceid,sourceassetid,sourceeventid,value,timestamp
   TableType = COPY_ON_WRITE
   PartitionPathFields = date,sourceid
   HiveStylePartitioning = True
   WriteOperation = insert
   CompressionCodec = snappy
   CommitsRetained = 1
   CombineBeforeInsert = True
   PrecombineField = timestamp
   InsertDropDuplicates = True
   InsertShuffleParallelism = 100 
   
   
   
   
   
   **Environment Description**
   
   Hudi version : 0.6.0
   
   Spark version : 2.4.3
   
   Hadoop version : 2.8.5-amzn-1
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : No. Running on AWS Glue


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to