I think that is at least partly doable within NiFi (ex yes, you can
restrict processors to the primary node in a cluster), but I would
recommend you consider a different approach for NiFi. Unlike Spark,
NiFi is perfectly content to stream in huge amounts of data
continuously without any temporary storage besides its repositories
(flowfile, content, etc). Therefore, I think a potentially easier
solution would be for your team to explore creating a connector
between NiFi and Azure Data Explorer that allows the latter to
firehose the former with data as it comes in and let the chips fall
where they may.

You might find some useful concepts in the Kafka processors for things
like managing a continuous flow of record data from a stream and
converting it to blocks of NiFi record data (see the Record API in our
documentation for details).

FWIW, I manage a data set in AWS w/ NiFi that is over 50TB compressed,
and a fairly generic 5 node NiFi cluster crushes that data on cheap
EC2 instances without issue. So handling TBs of data is something
fairly out of the box for NiFi if you're worried about that.

On Sat, Oct 1, 2022 at 12:05 AM Tanmaya Panda
<tanmayapa...@microsoft.com.invalid> wrote:
>
> Hi Team,
>
> We at Microsoft opensource, are developing a custom for Azure Data Explorer 
> sink connector for Apache Nifi. What we want to achieve is transactionality 
> with data ingestion. The source of the processor can be TBs of telemetry data 
> as well as CDC logs. What that means is in case of any failure while 
> ingesting data of particular partition to ADX, will delete/cleanup any other 
> ingested data of other partitions. Since Azure Data Explorer is an append 
> only database, unfortunately we cant perform delete on the ingested data of 
> same partition or other partition. So to achieve this kind of 
> transactionality of large ingested data, we are thinking to implement 
> something similar we have done for Apache Spark ADX connector. We will be 
> creating temporary tables inside Azure Data Explorer before ingesting to the 
> actual tables. The worker nodes in apache spark will create these temporary 
> tables and report the ingestion status to the driver node. The driver node on 
> receiving the success status of all the worker nodes performs ingestion on 
> the actual table or else the ingestion is aborted along with cleanup of 
> temporary table. So basically we aggregate the worker node task status in the 
> driver node in spark to take further decision on whether to ingest data into 
> ADX table or not.
> Question 1: Is this possible in Apache Nifi follows a zero master cluster 
> strategy as opposed to master-slave architecture of apache spark?
> Question 2: In our custom processor in Nifi, is it possible to run custom 
> code of a particular partition say the cluster coordinator node. Also is it 
> possible to get the details of the partition inside the processor?
> Question 3: Is is to possible to get the details of tasks executed at the 
> various partitions and take decisions based on the task status. Can all these 
> be done inside the same processor.
>
> Thanks,
> Tanmaya

Reply via email to