Thanks for following up Russ and glad you're making progress. On Oct 13, 2016 5:36 PM, "Russell Bateman" < [email protected]> wrote:
> As promised, and as embarrassing as it seems now, I'm reporting what > happened... > > It appears that one of our IT guys failed to type /G/ when he created the > swap partition on this staging server and it ended up sized at 128M instead > of 128G. (Fortunately, it's not a production server and I think we have > safeties in place to guard against similaring screwing up those > installations.) > > This turned up visibly using htop. Unfortunately, though htopwas an early > tool in our quest for what wasn't right, we were thinking it was something > in NiFi, one of our processors in the flow, etc., concentrating on that > angle, and we weren't looking at the top section output of htopuntil after > poring through NiFi logs and eliminating all other suspicions. > > Live and learn. > > Russ > > On 10/05/2016 06:21 PM, Andrew Grande wrote: > >> Just a sanity check, number of open file handles increased as per >> quickstart document? Might need much more for your flow. >> >> Another tip, when your server experiences undesired hiccups like that try >> running 'nifi.sh dump save-in-this-file.txt' and investigate/share where >> NiFi threads are being held back. >> >> Andrew >> >> On Tue, Oct 4, 2016, 10:54 AM Russell Bateman < >> [email protected]> wrote: >> >> We use the templating to create FHIR XML, in this case, a >>> >>> <Binary> >>> ... >>> <content value="$flowfile_contents" /> >>> </Binary> >>> >>> construct that includes a base-64 encoding of a PDF, the flowfile >>> contents coming into the templating processor. These can get to be >>> megabytes in size though our sample data was just under 1Mb. >>> >>> Yesterday, I built a new, reduced flow restricting the use of my >>> /VelocityTemplating/ processor to perform only the part of that task >>> that I suspected would be taking so much time, that is, copying the >>> base-64 data into the template in place of the VTL macro. However, I >>> could not reproduce the problem though I did this on the very production >>> server (actually, more of a staging server, but it was the very server >>> where the trouble was detected in the first place). >>> >>> Predictably (that is if, like me, you believe Murphy reigns supreme in >>> this universe), the action using the very files in question took >>> virtually no time at all, just as had been my experience running on my >>> local development host. I then slightly expanded the new flow to take in >>> some of the other trappings of the original one (but, it was the >>> templating that was reported as being the bottleneck--minutes to fill >>> out the template instead of milliseconds). In short, I could not >>> replicate the problem. True, the moon is in a different phase than late >>> last week when this was reported. >>> >>> I will come back here and report if and when we stumble upon this, it >>> reoccurs and/or we took a decision about anything, for the benefit of >>> the community. At present, we're looking to force re-ingestion of the >>> run, using the original flow design, including the documents that >>> reportedly experienced this trouble to see if it happens yet again. >>> >>> In the meantime, I can say: >>> >>> - I keep no state in this processor (indeed, I try not to and don't >>> think I have anything stateful in any of our custom processors). >>> - The server runs some 40 cores, 128Gb RAM on 12Tb of disk, >>> dedicated hardware, CentOS 7, recently built and installed. >>> - Reportedly, I learned, little else was going on on the server at >>> the same time, either in NiFi or elsewhere. >>> - NiFi heap is configured to be 12Gb. >>> - Not so far along yet as to understand thread usage or garbage >>> collection state. >>> >>> Again, thanks for the suggestions from both of you. >>> >>> Russ >>> >>> >>> On 10/03/2016 06:28 PM, Joe Witt wrote: >>> >>>> Russ, >>>> >>>> As Jeff points out lack of available threads could be a factor flow >>>> slower processing times but this would manifest itself by you seeing >>>> that the processor isn't running very often. If it is that the >>>> process itself when executing takes much longer than on the other box >>>> then it is probably best to look at some other culprits. To check >>>> this out you can view the status history and look at the average >>>> number of tasks and average task time for this process. Does it look >>>> right to you in terms of how often it runs, how long it takes, and is >>>> the amount of time it takes growing? >>>> >>>> If you find that performance of this processor itself is slowing then >>>> consider a few things. >>>> 1) Does it maintain some internal state and if so is the data >>>> structure it is using efficient for lookups? >>>> 2) How does your heap look? Is there a lot of garbage collection >>>> activity? Are there any full garbage collections and if so how often? >>>> It should generally be the case in a well configured and designed >>>> system that full garbage collections never occur (ever). >>>> 3) Attaching a remote debugger and/or running profilers on it can be >>>> really illuminating. >>>> >>>> JOe >>>> >>>> On Mon, Oct 3, 2016 at 11:26 AM, Jeff <[email protected]> wrote: >>>> >>>>> Russel, >>>>> >>>>> This sounds like it's an environmental issue. Are you able to see the >>>>> >>>> heap >>> >>>> usage on the production machine? Are there enough available threads to >>>>> >>>> get >>> >>>> the throughput you are observing when you run locally? Have you >>>>> double-checked the scheduling tab on the processor config to make sure >>>>> >>>> it >>> >>>> is running as aggressively as it runs locally? >>>>> >>>>> I have run into this sort of thing before, and it was because of >>>>> >>>> flowfile >>> >>>> congestion in other areas of the flow, and there were no threads >>>>> >>>> available >>> >>>> for other processors to get through their own queues. >>>>> >>>>> Just trying to think through some of the obvious/high level things that >>>>> might be affecting your flow... >>>>> >>>>> - Jeff >>>>> >>>>> On Mon, Oct 3, 2016 at 9:43 AM Russell Bateman < >>>>> [email protected]> wrote: >>>>> >>>>> We use NiFi for an ETL feed. On one of the lines, we use a custom >>>>>> processor, *VelocityTemplating* (calls Apache Velocity), which works >>>>>> >>>>> very >>> >>>> well and indeed is imperceptibly fast when run locally on the same data >>>>>> (template, VTL macros, substitution fodder). However, in production >>>>>> >>>>> it's >>> >>>> another matter. What takes no time at all in local runs takes minutes >>>>>> >>>>> in >>> >>>> that environment. >>>>>> >>>>>> I'm looking for suggestions as to a) why this might be and b) how best >>>>>> >>>>> to >>> >>>> go about examining/debugging it. I think I will soon have >>>>>> >>>>> remote-access to >>> >>>> the production machine (a VPN must be set up). >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Russ >>>>>> >>>>>> >>> >
