Re: SplitRecord behaviour

Bryan Bende Fri, 01 Mar 2019 10:08:00 -0800

If you increase the concurrent tasks on PublishKafka then you are
right that you could publish multiple records at the same time, but I
suspect that the overhead of doing the split will cancel out any gains
from publishing in parallel.


Assuming the flow file has a decent amount of records (thousands),
then you could do any of the following...

- Keep all the records in one flow file and use PublishKafkaRecord,
this will be most efficient for NiFi in terms of I/O and heap usage,
but only sending one record at a time to Kafka

- Split to one record per flow file, generally discouraged as it puts
significant stress on NiFI's repos and heap, but could publish
individual records in parallel once they reach PublisKafka

- Split to smaller batches, say you start with 10k records in the
original flow file then split to 5 flow files with 2k records each,
then PublishKafka with 5 concurrent tasks, but have to determine
whether this actually works out better than the first option

On Fri, Mar 1, 2019 at 12:47 PM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[email protected]> wrote:
>
> Thanks Bryan, I got your point.  Yeah we could try PublishKafkaRecord, as in 
> some of other case we had already used PublishKafkaRecord(csv data to avro) 
> to send out records.
> In the below mentioned use case we thought of sending out bunch of records(as 
> we are not doing anything with the data) at one shot instead of sending one 
> record at a time.
>
> Thanks,
> Hemantha
>
> -----Original Message-----
> From: Bryan Bende <[email protected]>
> Sent: Friday, March 1, 2019 7:52 PM
> To: [email protected]
> Subject: Re: SplitRecord behaviour
>
> Hello,
>
> Flow files are not transferred until the session they came form is committed. 
> So imagine we periodically commit and some of the splits are transferred, 
> then half way through a failure is encountered, the entire original flow file 
> will be reprocessed, producing some of the same splits that were already send 
> out. The way it is implemented now, it is either completely successful, or 
> not, but never partially successful producing duplicates.
>
> Based on the description of your flow with the three processors you 
> mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
> -> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
> same reader and writer you were using in SplitRecord, and it will read each 
> record and send to Kafka, without having to produce unnecessary flow files.
>
> Thanks,
>
> Bryan
>
> On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
> IN/Bangalore) <[email protected]> wrote:
> >
> > Hi All,
> >
> > We have a use case where receiving huge json(file size might vary from 1GB 
> > to 50GB) via http, convert in to XML(xml format is not fixed, any other 
> > format is fine) and send out using Kafka. - here is the restriction is CPU 
> > & RAM usage requirement(once it is fixed, it should handle all size files) 
> > should not getting changed based on incoming file size.
> >
> > We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one 
> > behaviour where SplitRecord is sending out data to PublishKafa only after 
> > whole FlowFile processing. Is there any reason why did we design this way? 
> > Will it not be good if we send out splits  to next processor after each 
> > configured records instead of all sending all splits at one shot?
> >
> >
> > Regards,
> > Hemantha
> >

Re: SplitRecord behaviour

Reply via email to