If you increase the concurrent tasks on PublishKafka then you are right that you could publish multiple records at the same time, but I suspect that the overhead of doing the split will cancel out any gains from publishing in parallel.
Assuming the flow file has a decent amount of records (thousands), then you could do any of the following... - Keep all the records in one flow file and use PublishKafkaRecord, this will be most efficient for NiFi in terms of I/O and heap usage, but only sending one record at a time to Kafka - Split to one record per flow file, generally discouraged as it puts significant stress on NiFI's repos and heap, but could publish individual records in parallel once they reach PublisKafka - Split to smaller batches, say you start with 10k records in the original flow file then split to 5 flow files with 2k records each, then PublishKafka with 5 concurrent tasks, but have to determine whether this actually works out better than the first option On Fri, Mar 1, 2019 at 12:47 PM Kumara M S, Hemantha (Nokia - IN/Bangalore) <[email protected]> wrote: > > Thanks Bryan, I got your point. Yeah we could try PublishKafkaRecord, as in > some of other case we had already used PublishKafkaRecord(csv data to avro) > to send out records. > In the below mentioned use case we thought of sending out bunch of records(as > we are not doing anything with the data) at one shot instead of sending one > record at a time. > > Thanks, > Hemantha > > -----Original Message----- > From: Bryan Bende <[email protected]> > Sent: Friday, March 1, 2019 7:52 PM > To: [email protected] > Subject: Re: SplitRecord behaviour > > Hello, > > Flow files are not transferred until the session they came form is committed. > So imagine we periodically commit and some of the splits are transferred, > then half way through a failure is encountered, the entire original flow file > will be reprocessed, producing some of the same splits that were already send > out. The way it is implemented now, it is either completely successful, or > not, but never partially successful producing duplicates. > > Based on the description of your flow with the three processors you > mentioned, I wouldn't bother using SplitRecord, just have ListenHttp > -> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the > same reader and writer you were using in SplitRecord, and it will read each > record and send to Kafka, without having to produce unnecessary flow files. > > Thanks, > > Bryan > > On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia - > IN/Bangalore) <[email protected]> wrote: > > > > Hi All, > > > > We have a use case where receiving huge json(file size might vary from 1GB > > to 50GB) via http, convert in to XML(xml format is not fixed, any other > > format is fine) and send out using Kafka. - here is the restriction is CPU > > & RAM usage requirement(once it is fixed, it should handle all size files) > > should not getting changed based on incoming file size. > > > > We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one > > behaviour where SplitRecord is sending out data to PublishKafa only after > > whole FlowFile processing. Is there any reason why did we design this way? > > Will it not be good if we send out splits to next processor after each > > configured records instead of all sending all splits at one shot? > > > > > > Regards, > > Hemantha > >
