Hello everyone,
Thank you for the continued support for NiFi.
As we are planning to implement couple of our DataFlow to production, need
some help on best practices.
This is how the Flow looks like
1. NiFi-Kafka consumer to get data from Kafka cluster
2. Convert data from JSON to Avro
3. Merge Avro FlowFiles to around 60 MB size
4. Store 60 MB FlowFile in Google Cloud Storage
5. We are using an expression to date partition it while storing
-
trackerClicks/${now():format('yyyy')}/${now():format('MM')}/${now():format('dd')}/${filename}
Regarding this we have few questions
1. Can we access the column from Avro schema to partition the data?
2. I have used *Kite SDK to create HIVE dataset* with partition (Parquet
format and SNAPPY compression) and "StoreInKiteDataset" processor to write
there. However if the data is merged already and we are using partition
strategy from a column then it is very slow and at the end the stored is
not merged (small files) for obvious reason. Do you suggest writing to
partitions without merging the data and later use *Kite Compaction*?
3. If we store data on GCS with normal date partitioning and create HIVE
External table on top of it, is there a way to refresh the Metadata when
there is a new partition added? Or we need to use "MSCK REPAIR ..." through
some script?
4. Lastly for the Kafka consumer is there any log feed or something that
our monitoring tool can refer to in order to show the offsets consumed, lag
and etc.
It would really be highly appreciated if we can get the answers?
Thank you!
______________________
*Kind Regards,*
*Anshuman Ghosh*
*Contact - +49 179 9090964*