Need help with NiFi use-case on Partition and merge

Anshuman Ghosh Thu, 04 May 2017 02:46:51 -0700

Hello everyone,

Thank you for the continued support for NiFi.
As we are planning to implement couple of our DataFlow to production, need
some help on best practices.


This is how the Flow looks like

   1. NiFi-Kafka consumer to get data from Kafka cluster
   2. Convert data from JSON to Avro
   3. Merge Avro FlowFiles to around 60 MB size
   4. Store 60 MB FlowFile in Google Cloud Storage
   5. We are using an expression to date partition it while storing
   - 
trackerClicks/${now():format('yyyy')}/${now():format('MM')}/${now():format('dd')}/${filename}


Regarding this we have few questions

   1. Can we access the column from Avro schema to partition the data?
   2. I have used *Kite SDK to create HIVE dataset* with partition (Parquet
   format and SNAPPY compression) and "StoreInKiteDataset" processor to write
   there. However if the data is merged already and we are using partition
   strategy from a column then it is very slow and at the end the stored is
   not merged (small files) for obvious reason. Do you suggest writing to
   partitions without merging the data and later use *Kite Compaction*?
   3. If we store data on GCS with normal date partitioning and create HIVE
   External table on top of it, is there a way to refresh the Metadata when
   there is a new partition added? Or we need to use "MSCK REPAIR ..." through
   some script?
   4. Lastly for the Kafka consumer is there any log feed or something that
   our monitoring tool can refer to in order to show the offsets consumed, lag
   and etc.


It would really be highly appreciated if we can get the answers?



Thank you!

______________________

*Kind Regards,*
*Anshuman Ghosh*
*Contact - +49 179 9090964*

Need help with NiFi use-case on Partition and merge

Reply via email to