[GitHub] nifi-minifi pull request: MINIFI-5: Base MiNiFi executable

2016-04-05 Thread apiri
GitHub user apiri opened a pull request:

https://github.com/apache/nifi-minifi/pull/5

MINIFI-5: Base MiNiFi executable

MINIFI-5:  Creating a base MiNiFi project to serve as a basis for further 
extension and design reusing NiFi libraries

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apiri/nifi-minifi minifi-5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi-minifi/pull/5.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5


commit e94b6dae3dcb18bd60c9234de62543c897e83146
Author: Aldrin Piri 
Date:   2016-04-06T01:19:57Z

MINIFI-5:  Creating a base MiNiFi project to serve as a basis for further 
extension and design reusing NiFi libraries




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1582 added state to UpdateAttribute as wel...

2016-04-05 Thread mattyb149
Github user mattyb149 commented on the pull request:

https://github.com/apache/nifi/pull/319#issuecomment-206094743
  
I think you can re-push the PR (no new commits) and it might pick that up 
and rebuild.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1582 added state to UpdateAttribute as wel...

2016-04-05 Thread JPercivall
Github user JPercivall commented on the pull request:

https://github.com/apache/nifi/pull/319#issuecomment-206094554
  
Not sure why it failed on something unrelated on Travis it builds locally. 
Also I don't believe there is an easy way to rebuild the PR in Travis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Re: Re: Filtering large CSV files

2016-04-05 Thread Dmitry Goldenberg
Uwe,

The Velocity based transformer sounds like a cool feature.  As far as the
splitter, I'm not quite groking why it treats its input as a single row to
split?  Shouldn't the input be a full CSV which you'd want to split?  I
guess you already have a splitter, perhaps based on SplitText.  What I want
to do is implement a SplitCSV (and GetCSV) which uses OpenCSV to split a
full CSV into individual rows.

- Dmitry

On Tue, Apr 5, 2016 at 4:06 PM, Uwe Geercken  wrote:

> Dmitry,
>
> what I have is at the moment this:
>
> https://github.com/uwegeercken/nifi_processors
>
> Two processors: one that splits one CSV row and assigns the values to
> flowfile attributes. And one that merges the attributes with a template
> (apache velocity) to produce a different output.
>
> I wanted to start with opencsv but ran into problems and got no time
> afterwards.
>
> Rgds,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> > Von: "Dmitry Goldenberg" 
> > An: dev@nifi.apache.org
> > Betreff: Re: Re: Filtering large CSV files
> >
> > Hi Uwe,
> >
> > Yes, that is what I was thinking of using for the CSV processor.  Will
> you
> > be committing your version?
> >
> > - Dmitry
> >
> > On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken 
> wrote:
> >
> > > Dimitry,
> > >
> > > I was working on a processor for CSV files and one remark came up that
> we
> > > might want to use the opencsv library for parsing the file.
> > >
> > > Here is the link: http://opencsv.sourceforge.net/
> > >
> > > Greetings,
> > >
> > > Uwe
> > >
> > > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > > Von: "Dmitry Goldenberg" 
> > > > An: dev@nifi.apache.org
> > > > Betreff: Re: Filtering large CSV files
> > > >
> > > > Hi Eric,
> > > >
> > > > Thinking about exactly these use-cases, I filed the following JIRA
> > > ticket:
> > > > NIFI-1716 . It asks
> > > for a
> > > > SplitCSV processor, and actually for a GetCSV ingress which would
> address
> > > > the issue of reading out of a large CSV treating it as a "data
> source".
> > > I
> > > > was thinking of actually implementing both and committing them.
> > > >
> > > > NIFI-1280  is
> asking
> > > for a
> > > > way to filter the CSV columns.  I believe this is best achieved as
> the
> > > CSV
> > > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not
> as a
> > > > separate step.
> > > >
> > > > I'm not sure that SplitText is the best way to process CSV data to
> begin
> > > > with, because with a CSV, there's a chance that a given cell may
> spill
> > > over
> > > > into multiple lines. Such would be the case of embedded newlines
> within a
> > > > single, quoted cell. I don't think SplitText addresses that and that
> > > would
> > > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > > semantics, the other reason being efficiency of reading.
> > > >
> > > > As far as the limit on the capturing groups, that seems arbitrary. I
> > > think
> > > > that on GetCSV/SplitCSV, if you have a way to identify the filtered
> out
> > > > columns by their number (index) that should go a long way; perhaps a
> > > regex
> > > > is also a good option.  I know it may seem that filtering should be a
> > > > separate step in a given dataflow but from the point of view of
> > > efficiency,
> > > > I believe it belongs right in the GetCSV/SplitCSV processors as the
> CSV
> > > > records are being read and processed.
> > > >
> > > > - Dmitry
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK  wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I would require to filter large csv files in a data flow. By
> filtering
> > > I
> > > > > mean: scale down the file in terms of columns, and looking for a
> > > particular
> > > > > value to match a parameter. I looked into the example, of csv to
> JSON.
> > > I do
> > > > > have a couple of questions:
> > > > >
> > > > > -First I use a SplitText control get each line of the file. It
> makes
> > > > > things slow, as it seems to generate a flow file for each line. Do
> I
> > > have
> > > > > to proceed this way, or is there an alternative? My csv files are
> > > really
> > > > > large and can have millions of lines.
> > > > >
> > > > > -In a second step I am extracting the values with the
> (.+),(.+),….,(.+)
> > > > > technique, before using a processor to check for a match, on
> > > ${csv.146} for
> > > > > instance. Now I have a problem: my csv has 233 fields, so I am
> getting
> > > the
> > > > > message: “ReGex is required to have between 1 and 40 capturing
> groups
> > > but
> > > > > has 233”. Again, is there another way to proceed, am I missing
> > > something?
> > > > >
> > > > > Best regards,
> > > > > Eric
> > > >
> > >
> >
>


[GitHub] nifi pull request: Second update getting-started.adoc

2016-04-05 Thread andrewmlim
GitHub user andrewmlim opened a pull request:

https://github.com/apache/nifi/pull/329

Second update getting-started.adoc

Corrected button, menu item and icon inconsistencies/errors.  Fixed 
bulleted list formatting error in "Working with Templates" section.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewmlim/nifi patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/329.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #329


commit 372f6476b037ed401fde05adf45a21999b773070
Author: Andrew Lim 
Date:   2016-04-05T20:34:21Z

Update getting-started.adoc

Corrected button, menu item and icon inconsistencies/errors.  Fixed 
bulleted list formatting error in "Working with Templates" section.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1690 Changed MonitorMemory to use allowabl...

2016-04-05 Thread olegz
GitHub user olegz opened a pull request:

https://github.com/apache/nifi/pull/328

NIFI-1690 Changed MonitorMemory to use allowable values for pool names

- removed dead code from MonitorMemory
- added MonitorMemoryTest
- minor refactoring in MonitorMemory
- initial fix for NIFI-1731 (WARN logging) that was required by 
MonitorMemoryTest

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olegz/nifi NIFI-1690

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #328


commit 837e17547c40005bf867ffccfd211d2a8ef8ffed
Author: Oleg Zhurakousky 
Date:   2016-04-05T18:24:46Z

NIFI-1690 Changed MonitorMemory to use allowable values for pool names
- removed dead code from MonitorMemory
- added MonitorMemoryTest
- minor refactoring in MonitorMemory
- initial fix for NIFI-1731 (WARN logging) that was required by 
MonitorMemoryTest




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Aw: Re: Re: Filtering large CSV files

2016-04-05 Thread Uwe Geercken
Dmitry,

what I have is at the moment this:

https://github.com/uwegeercken/nifi_processors

Two processors: one that splits one CSV row and assigns the values to flowfile 
attributes. And one that merges the attributes with a template (apache 
velocity) to produce a different output.

I wanted to start with opencsv but ran into problems and got no time afterwards.

Rgds,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> Von: "Dmitry Goldenberg" 
> An: dev@nifi.apache.org
> Betreff: Re: Re: Filtering large CSV files
>
> Hi Uwe,
> 
> Yes, that is what I was thinking of using for the CSV processor.  Will you
> be committing your version?
> 
> - Dmitry
> 
> On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken  wrote:
> 
> > Dimitry,
> >
> > I was working on a processor for CSV files and one remark came up that we
> > might want to use the opencsv library for parsing the file.
> >
> > Here is the link: http://opencsv.sourceforge.net/
> >
> > Greetings,
> >
> > Uwe
> >
> > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > Von: "Dmitry Goldenberg" 
> > > An: dev@nifi.apache.org
> > > Betreff: Re: Filtering large CSV files
> > >
> > > Hi Eric,
> > >
> > > Thinking about exactly these use-cases, I filed the following JIRA
> > ticket:
> > > NIFI-1716 . It asks
> > for a
> > > SplitCSV processor, and actually for a GetCSV ingress which would address
> > > the issue of reading out of a large CSV treating it as a "data source".
> > I
> > > was thinking of actually implementing both and committing them.
> > >
> > > NIFI-1280  is asking
> > for a
> > > way to filter the CSV columns.  I believe this is best achieved as the
> > CSV
> > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > > separate step.
> > >
> > > I'm not sure that SplitText is the best way to process CSV data to begin
> > > with, because with a CSV, there's a chance that a given cell may spill
> > over
> > > into multiple lines. Such would be the case of embedded newlines within a
> > > single, quoted cell. I don't think SplitText addresses that and that
> > would
> > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > semantics, the other reason being efficiency of reading.
> > >
> > > As far as the limit on the capturing groups, that seems arbitrary. I
> > think
> > > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > > columns by their number (index) that should go a long way; perhaps a
> > regex
> > > is also a good option.  I know it may seem that filtering should be a
> > > separate step in a given dataflow but from the point of view of
> > efficiency,
> > > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > > records are being read and processed.
> > >
> > > - Dmitry
> > >
> > >
> > >
> > >
> > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK  wrote:
> > >
> > > > Dear all,
> > > >
> > > > I would require to filter large csv files in a data flow. By filtering
> > I
> > > > mean: scale down the file in terms of columns, and looking for a
> > particular
> > > > value to match a parameter. I looked into the example, of csv to JSON.
> > I do
> > > > have a couple of questions:
> > > >
> > > > -First I use a SplitText control get each line of the file. It makes
> > > > things slow, as it seems to generate a flow file for each line. Do I
> > have
> > > > to proceed this way, or is there an alternative? My csv files are
> > really
> > > > large and can have millions of lines.
> > > >
> > > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > > technique, before using a processor to check for a match, on
> > ${csv.146} for
> > > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> > the
> > > > message: “ReGex is required to have between 1 and 40 capturing groups
> > but
> > > > has 233”. Again, is there another way to proceed, am I missing
> > something?
> > > >
> > > > Best regards,
> > > > Eric
> > >
> >
>


Re: Re: Filtering large CSV files

2016-04-05 Thread Dmitry Goldenberg
Hi Uwe,

Yes, that is what I was thinking of using for the CSV processor.  Will you
be committing your version?

- Dmitry

On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken  wrote:

> Dimitry,
>
> I was working on a processor for CSV files and one remark came up that we
> might want to use the opencsv library for parsing the file.
>
> Here is the link: http://opencsv.sourceforge.net/
>
> Greetings,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > Von: "Dmitry Goldenberg" 
> > An: dev@nifi.apache.org
> > Betreff: Re: Filtering large CSV files
> >
> > Hi Eric,
> >
> > Thinking about exactly these use-cases, I filed the following JIRA
> ticket:
> > NIFI-1716 . It asks
> for a
> > SplitCSV processor, and actually for a GetCSV ingress which would address
> > the issue of reading out of a large CSV treating it as a "data source".
> I
> > was thinking of actually implementing both and committing them.
> >
> > NIFI-1280  is asking
> for a
> > way to filter the CSV columns.  I believe this is best achieved as the
> CSV
> > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > separate step.
> >
> > I'm not sure that SplitText is the best way to process CSV data to begin
> > with, because with a CSV, there's a chance that a given cell may spill
> over
> > into multiple lines. Such would be the case of embedded newlines within a
> > single, quoted cell. I don't think SplitText addresses that and that
> would
> > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > semantics, the other reason being efficiency of reading.
> >
> > As far as the limit on the capturing groups, that seems arbitrary. I
> think
> > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > columns by their number (index) that should go a long way; perhaps a
> regex
> > is also a good option.  I know it may seem that filtering should be a
> > separate step in a given dataflow but from the point of view of
> efficiency,
> > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > records are being read and processed.
> >
> > - Dmitry
> >
> >
> >
> >
> > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK  wrote:
> >
> > > Dear all,
> > >
> > > I would require to filter large csv files in a data flow. By filtering
> I
> > > mean: scale down the file in terms of columns, and looking for a
> particular
> > > value to match a parameter. I looked into the example, of csv to JSON.
> I do
> > > have a couple of questions:
> > >
> > > -First I use a SplitText control get each line of the file. It makes
> > > things slow, as it seems to generate a flow file for each line. Do I
> have
> > > to proceed this way, or is there an alternative? My csv files are
> really
> > > large and can have millions of lines.
> > >
> > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > technique, before using a processor to check for a match, on
> ${csv.146} for
> > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> the
> > > message: “ReGex is required to have between 1 and 40 capturing groups
> but
> > > has 233”. Again, is there another way to proceed, am I missing
> something?
> > >
> > > Best regards,
> > > Eric
> >
>


Re: [DISCUSS] git branching model

2016-04-05 Thread Andy LoPresto
I wrote up a quick guide for reviewers to apply the patches to multiple support 
branches on the wiki [1].

[1] 
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-Stepstomerge/closepullrequestswithtwomainbranches
 


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Apr 4, 2016, at 4:01 PM, Sean Busbey  wrote:
> 
> On Mon, Apr 4, 2016 at 4:40 PM, Adam Lamar  wrote:
>> On Mon, Apr 4, 2016 at 11:14 AM, Sean Busbey  wrote:
>>> You're correct, a github PR only targets a single branch and Travis-CI
>>> only checks how the PR does at its own commit hash (that is, it
>>> doesn't even check what the target branch would look like post-merge).
>>> 
>> 
>> Hey Sean, it looks like Travis-CI does merge into the specified merge
>> branch before running. For example, my ListS3 PR [1] has commit
>> 2f7e89e, but the Travis-CI status page [1] shows commit e0868c2. If
>> you find that in github [2] you'll see it is the merge of 2f7e89e and
>> 6f5fb59, where 6f5fb59 is the location of master when I submitted the
>> PR.
>> 
> 
> Excellent. I am very happy to find that I'm incorrect. I must just be
> used to using Travis-CI with very old PRs. :)



signature.asc
Description: Message signed with OpenPGP using GPGMail


[GitHub] nifi pull request: Update getting-started.adoc

2016-04-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/327


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Question concerning jetty server timeout in 0.5.1 HandleHttpRequest

2016-04-05 Thread Mark Payne
Luke,

I have not yet heard of anyone else running into this, personally. Typically, 
when I've used these
processors, I am expecting sub-second response times, not 30+ second response 
times. Of course,
your use case, though, is perfectly valid - just not something that I've ever 
run into myself.

I have submitted a new JIRA [1] to address this issue.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-1732 



> On Apr 1, 2016, at 12:32 PM, Stephen Coder - US  wrote:
> 
> Hi Nifi Team,
> 
> 
> I've been experimenting with the HandleHttpRequest/Response processors in 
> Nifi 0.5.1, and have noticed an issue that I've not been able to resolve. I'm 
> hoping that I'm simply missing a configuration item, but I've been unable to 
> find the solution.
> 
> 
> The scenario is this: HandleHttpRequest --> Long Processing (> 30 seconds) 
> --> HandleHttpResponse. It appears that the jetty server backing the 
> HandleHttpRequest has a built in idle time timeout of 3 ms (see 
> jetty-server/src/main/java/org/eclipse/jetty/server/AbstractConnector.java 
> _idleTimeout value). In my test flow, 30 seconds after a HTTP requests comes 
> in, a second request comes into the flow. It has the same information, except 
> the http.context.identifier and the FlowFile UUID has changed, and the 
> http.dispatcher.type has changed from REQUEST to ERROR. From my online 
> research 
> (http://stackoverflow.com/questions/30786939/jetty-replay-request-on-timeout?),
>  this re-request with a type of error comes in after jetty determines that a 
> request has timed out.
> 
> 
> This would not normally be a big deal. I was able to RouteOnAttribute and 
> capture all ERROR requests without responding. However, those requests are 
> never cleared from the StandardHttpContextMap. I've tested this by setting 
> the number of requests allowed by the StandardHttpContextMap to 4, and done 4 
> of my long Request/Response tests. Each request is correctly responded to 
> eventually in my test, but because they take over 30 seconds each also 
> generates an ERROR request that is stored in the StandardHttpContextMap. If I 
> then leave the system alone for much longer than the Request Timeout 
> parameter in the StandardHttpContextMap and then attempt a request, I get a 
> 503 response saying that the queue is full and no requests are allowed. No 
> requests are allowed at all until I delete and recreate the Map.
> 
> 
> It seems unlikely to me that no one has attempted to use these processors in 
> this fashion. However, looking through the unit test for this processor it 
> seems like no where was a timeout tested over 30 seconds, so I thought it 
> worth a conversation.
> 
> 
> So finally, is there a configuration item to extend the jetty server's idle 
> timeout? Or is there a better way to ensure that the bogus requests don't get 
> stuck permanently in the StandardHttpContextMap? I appreciate any pointers 
> you can give.
> 
> 
> Thanks,
> Luke Coder
> BIT Systems
> CACI - NCS
> 941-907-8803 x705
> 6851 Professional Pkwy W
> Sarasota, FL 34240



Aw: Re: Filtering large CSV files

2016-04-05 Thread Uwe Geercken
Dimitry,

I was working on a processor for CSV files and one remark came up that we might 
want to use the opencsv library for parsing the file.

Here is the link: http://opencsv.sourceforge.net/

Greetings,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> Von: "Dmitry Goldenberg" 
> An: dev@nifi.apache.org
> Betreff: Re: Filtering large CSV files
>
> Hi Eric,
> 
> Thinking about exactly these use-cases, I filed the following JIRA ticket:
> NIFI-1716 . It asks for a
> SplitCSV processor, and actually for a GetCSV ingress which would address
> the issue of reading out of a large CSV treating it as a "data source".  I
> was thinking of actually implementing both and committing them.
> 
> NIFI-1280  is asking for a
> way to filter the CSV columns.  I believe this is best achieved as the CSV
> is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> separate step.
> 
> I'm not sure that SplitText is the best way to process CSV data to begin
> with, because with a CSV, there's a chance that a given cell may spill over
> into multiple lines. Such would be the case of embedded newlines within a
> single, quoted cell. I don't think SplitText addresses that and that would
> be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> semantics, the other reason being efficiency of reading.
> 
> As far as the limit on the capturing groups, that seems arbitrary. I think
> that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> columns by their number (index) that should go a long way; perhaps a regex
> is also a good option.  I know it may seem that filtering should be a
> separate step in a given dataflow but from the point of view of efficiency,
> I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> records are being read and processed.
> 
> - Dmitry
> 
> 
> 
> 
> On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK  wrote:
> 
> > Dear all,
> >
> > I would require to filter large csv files in a data flow. By filtering I
> > mean: scale down the file in terms of columns, and looking for a particular
> > value to match a parameter. I looked into the example, of csv to JSON. I do
> > have a couple of questions:
> >
> > -First I use a SplitText control get each line of the file. It makes
> > things slow, as it seems to generate a flow file for each line. Do I have
> > to proceed this way, or is there an alternative? My csv files are really
> > large and can have millions of lines.
> >
> > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > technique, before using a processor to check for a match, on ${csv.146} for
> > instance. Now I have a problem: my csv has 233 fields, so I am getting the
> > message: “ReGex is required to have between 1 and 40 capturing groups but
> > has 233”. Again, is there another way to proceed, am I missing something?
> >
> > Best regards,
> > Eric
>


How to pass variables to nifi Processor

2016-04-05 Thread Rajeswari Raghunathan - Contractor
Hi Team,

We are using NIFI in our company to migrate all data from SFTP server to HDFS.
We have different environment like dev, test, prod and  want to accomplish 
Continuous integration with NIFI. In order to do so, we need to separate all 
sensitive data like server hostname,username,password from hardcoding in 
Processor (eg:-GETSFTP).
Can anyone help me by providing best solution for this problem?

Regards,
Rajeswari


[GitHub] nifi pull request: Update getting-started.adoc

2016-04-05 Thread alopresto
Github user alopresto commented on the pull request:

https://github.com/apache/nifi/pull/327#issuecomment-205869630
  
This is excellent. Thanks Drew. 

+1. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: Update getting-started.adoc

2016-04-05 Thread andrewmlim
GitHub user andrewmlim opened a pull request:

https://github.com/apache/nifi/pull/327

Update getting-started.adoc

Corrected spelling/grammatical errors

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewmlim/nifi patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #327


commit 7d46142635a94940750b1cbbe666747ea5782b48
Author: andrewmlim 
Date:   2016-04-05T14:23:54Z

Update getting-started.adoc

Corrected spelling/grammatical errors




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Error setting up environment for Custom Processor

2016-04-05 Thread idioma
Thanks Bryan,
over the weekend I had another look at the issue and yes I managed to get it
work changing it to the latest NiFi 0.6.0. 

Thank you for your help,

I. 



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Error-setting-up-environment-for-Custom-Processor-tp8703p8805.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


[GitHub] nifi pull request: NIFI-1689 ListFile test inconsistent on various...

2016-04-05 Thread apiri
Github user apiri closed the pull request at:

https://github.com/apache/nifi/pull/326


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1668 modified TestProcessorLifecycle to en...

2016-04-05 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/324#issuecomment-205818733
  
Closing as it has been merged into the 0.6-support, 0.x and master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1689 ListFile test inconsistent on various...

2016-04-05 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/326#issuecomment-205818958
  
Closing as it has been merged into the 0.6-support, 0.x and master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1728

2016-04-05 Thread apiri
Github user apiri closed the pull request at:

https://github.com/apache/nifi/pull/325


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1728

2016-04-05 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/325#issuecomment-205818587
  
Closing as it has been merged into the 0.6-support, 0.x and master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: NiFi Installation : Subscribtion required

2016-04-05 Thread Sean Busbey
For help with vendor specific questions, please use the appropriate
vendor support channel.

FYI, the Hortonworks community support track that covers NiFi is:

https://community.hortonworks.com/spaces/63/data-flow-track.html

On Tue, Apr 5, 2016 at 4:25 AM, Aankita Kaur  wrote:
>  I am unable to install NiFi in Hortonworks 2.3.2 environment.Currently, I
> tried https://github.com/abajwa-hw/ambari-nifi-service tutorial but have
> not yet started with its services.Also since its half way installed so I
> dont know how to remove its services first & start with new installationin
> hortonworks environment.Please help
> --
> *Warm Regards*
>
> *Dr. AANKITA KAUR*



-- 
busbey


RE: Feature Requests: 1) Embedded FTP/SFTP server 2) Email-ingest-source

2016-04-05 Thread manoj.seshan
On Email, while SMTP Server would work, I was thinking more on the lines of 
Spring Integrations' POP/IMAP client Mail-Receiving Channel Adaptor.

Regards

Manoj Seshan - Senior Architect
Platform Content Technology, Bangalore

Voice: +91-9686578756  +91-80-67492572


-Original Message-
From: Joe Witt [mailto:joe.w...@gmail.com] 
Sent: Tuesday, April 05, 2016 7:05 PM
To: dev@nifi.apache.org
Cc: V, Rohini (TR Technology & Ops); Sundara, Shyama (TR Technology & Ops)
Subject: Re: Feature Requests: 1) Embedded FTP/SFTP server 2) 
Email-ingest-source

Manoj,

Sounds good and am on board.  Have been wanting to do those for a long time 
frankly.  We do have a JIRA in for GetEmail and it will be quite easy to do now 
that we have state management as a feature.  Although are you on the email side 
more thinking that NiFi there would act as an SMTP server?

Thanks
Joe

On Tue, Apr 5, 2016 at 9:30 AM,   wrote:
> Yes Joe - ListenFTP and ListenSFTP capabilities, embedded into NiFi. 
> Rather than external daemons, followed by NiFi GetFTP and such. I'll 
> get back to you on developing and contributing the same myself or from 
> our team. Regards
>
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
>
> Voice: +91-9686578756  +91-80-67492572
>
>
> -Original Message-
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Tuesday, April 05, 2016 6:56 PM
> To: dev@nifi.apache.org
> Cc: V, Rohini (TR Technology & Ops); Sundara, Shyama (TR Technology & 
> Ops)
> Subject: Re: Feature Requests: 1) Embedded FTP/SFTP server 2) 
> Email-ingest-source
>
> Manoj,
>
> Are you saying you'd like NiFi to be able to directly accept/act as an 
> SFTP/FTP server so that when data is pushed to a NiFi server the 
> administrators do not need to setup additional FTP/SFTP daemons on those 
> systems?  Such processors would be straightforward to develop and contribute. 
>  Are you interested in doing so?
>
> Thanks
> Joe
>
> On Tue, Apr 5, 2016 at 9:22 AM,   wrote:
>> We have several hundred content providers who PUSH CONTENT into our systems 
>> via FTP/SFTP. And then some where we grab the body and/or attachments of 
>> emails send to specific mailboxes as the source of content. It would augment 
>> and round out NiFi's ingest capabilities significantly if processors for 
>> these sources were added to the NiFi stable.
>>
>> Regards
>>
>> Manoj Seshan - Senior Architect
>> Platform Content Technology, Bangalore
>> Voice: +91-9686578756  +91-80-67492572
>>


Re: Feature Requests: 1) Embedded FTP/SFTP server 2) Email-ingest-source

2016-04-05 Thread Joe Witt
Manoj,

Sounds good and am on board.  Have been wanting to do those for a long
time frankly.  We do have a JIRA in for GetEmail and it will be quite
easy to do now that we have state management as a feature.  Although
are you on the email side more thinking that NiFi there would act as
an SMTP server?

Thanks
Joe

On Tue, Apr 5, 2016 at 9:30 AM,   wrote:
> Yes Joe - ListenFTP and ListenSFTP capabilities, embedded into NiFi. Rather 
> than external daemons, followed by NiFi GetFTP and such. I'll get back to you 
> on developing and contributing the same myself or from our team. Regards
>
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
>
> Voice: +91-9686578756  +91-80-67492572
>
>
> -Original Message-
> From: Joe Witt [mailto:joe.w...@gmail.com]
> Sent: Tuesday, April 05, 2016 6:56 PM
> To: dev@nifi.apache.org
> Cc: V, Rohini (TR Technology & Ops); Sundara, Shyama (TR Technology & Ops)
> Subject: Re: Feature Requests: 1) Embedded FTP/SFTP server 2) 
> Email-ingest-source
>
> Manoj,
>
> Are you saying you'd like NiFi to be able to directly accept/act as an 
> SFTP/FTP server so that when data is pushed to a NiFi server the 
> administrators do not need to setup additional FTP/SFTP daemons on those 
> systems?  Such processors would be straightforward to develop and contribute. 
>  Are you interested in doing so?
>
> Thanks
> Joe
>
> On Tue, Apr 5, 2016 at 9:22 AM,   wrote:
>> We have several hundred content providers who PUSH CONTENT into our systems 
>> via FTP/SFTP. And then some where we grab the body and/or attachments of 
>> emails send to specific mailboxes as the source of content. It would augment 
>> and round out NiFi's ingest capabilities significantly if processors for 
>> these sources were added to the NiFi stable.
>>
>> Regards
>>
>> Manoj Seshan - Senior Architect
>> Platform Content Technology, Bangalore
>> Voice: +91-9686578756  +91-80-67492572
>>


RE: Feature Requests: 1) Embedded FTP/SFTP server 2) Email-ingest-source

2016-04-05 Thread manoj.seshan
Yes Joe - ListenFTP and ListenSFTP capabilities, embedded into NiFi. Rather 
than external daemons, followed by NiFi GetFTP and such. I'll get back to you 
on developing and contributing the same myself or from our team. Regards

Manoj Seshan - Senior Architect
Platform Content Technology, Bangalore

Voice: +91-9686578756  +91-80-67492572


-Original Message-
From: Joe Witt [mailto:joe.w...@gmail.com] 
Sent: Tuesday, April 05, 2016 6:56 PM
To: dev@nifi.apache.org
Cc: V, Rohini (TR Technology & Ops); Sundara, Shyama (TR Technology & Ops)
Subject: Re: Feature Requests: 1) Embedded FTP/SFTP server 2) 
Email-ingest-source

Manoj,

Are you saying you'd like NiFi to be able to directly accept/act as an SFTP/FTP 
server so that when data is pushed to a NiFi server the administrators do not 
need to setup additional FTP/SFTP daemons on those systems?  Such processors 
would be straightforward to develop and contribute.  Are you interested in 
doing so?

Thanks
Joe

On Tue, Apr 5, 2016 at 9:22 AM,   wrote:
> We have several hundred content providers who PUSH CONTENT into our systems 
> via FTP/SFTP. And then some where we grab the body and/or attachments of 
> emails send to specific mailboxes as the source of content. It would augment 
> and round out NiFi's ingest capabilities significantly if processors for 
> these sources were added to the NiFi stable.
>
> Regards
>
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
> Voice: +91-9686578756  +91-80-67492572
>


Re: Feature Request: Isolated Processors on ANY ONE node rather than on Primary node alone

2016-04-05 Thread Alan Jackoway
We have begun using a workaround for a problem similar to this, but it is
fairly ugly. In many cases we really want to run something like an
ingestion process from an external system at a specific time on one node.
Without https://issues.apache.org/jira/browse/NIFI-401 you can't quite do
it.

What we do instead is we run a cron-scheduled GenerateFlowFiles processor
and then pipe it into RouteOnAttributes where the attribute expressions
look like this:
* host1 - ${hostname():equals('host1.cloudera.com')}
* host2 - ${hostname():equals('host2.cloudera.com')}
...

Then we only connect the node we want the code to run on.

The downside is that it is fragile. If one of those hosts goes down we have
to find everywhere that we chose it as the running node and change them to
some other host. Additionally there is no concept of choosing the node
based on load, so we have to make sure we spread out the work appropriately.

Scheduling Group - as long as it supports cron - sounds wonderful to me,
and I am looking forward to the solution. But if you need a way to do it
now and are willing to do the scheduling manually there is a way to do it.
Alan

On Tue, Apr 5, 2016 at 8:46 AM,  wrote:

> Hi Mark - thanks for your prompt response. A few thoughts ..
>
> a) Currently, when Processor A is configured to run on the Primary Node,
> in the absence of special configuration (e.g. to the rest of the flow
> configured as a Process Group), the downstream Processors in the flow seem
> to automatically run on the Primary Node too. So in a sense, we have the
> affinity or grouping of processors to a given node already, except this is
> limited to the Primary Node. Could we not allow the scheduling of the
> Isolated Processor to occur on ANY single node, rather than just the
> Primary node? That would suffice for our current use case - i.e. we would
> be perfectly load balanced on initial ingest ACROSS the entire cluster,
> even though the entire downstream flow would run on whichever node the
> isolated processor was (randomly) scheduled on.
>
> b) That said, the "Scheduling Group" paradigm sounds very promising, if
> that includes the ability to Group Processors/Flows, as well as restrict
> their running to Groups-of-nodes. It is even more interesting if the
> concept can be coupled with Multi-tenancy, so cluster-resources (viz. the
> nodes) can be partitioned/isolated-to particular tenants.
>
> Regards, Manoj
>
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
>
> Voice: +91-9686578756  +91-80-67492572
>
> -Original Message-
> From: Mark Payne [mailto:marka...@hotmail.com]
> Sent: Tuesday, April 05, 2016 6:01 PM
> To: dev@nifi.apache.org
> Subject: Re: Feature Request: Isolated Processors on ANY ONE node rather
> than on Primary node alone
>
> Manoj,
>
> That is a very good point, and it is something that we are working toward.
> However, it does get a little bit more complicated than this. If you have
> some Processor, say Processor A running on some arbitrary node, there will
> often be times that you will also need another Processor, Processor B,
> running on that same node.
>
> Using a Primary Node means that we are able to accomplish this easily, but
> as you are noting here, it is quite limiting. In version 1.0.0 of NiFi, one
> of the big changes in a Zero-Master clustering design, whereby the Primary
> Node is automatically elected and fails over to a different node whenever
> the Primary Node leaves the cluster. This improves the overall
> functionality of Primary Node but does not address the issue here, of
> avoiding scheduling all "singleton" processors on the same node.
>
> I think the path that we'd like to take moving forward, post-1.0.0, is to
> provide a mechanism that allows the user to schedule a Processor to run in
> some sort of named "Scheduling Group". So, for instance, you could say
> Processor A and B should both run in "Group A" but Processor C should run
> in "Group C". This way, we can ensure that Processors that need to run
> together can do so while at the same time avoiding the need for all such
> processors to run on the same node.
>
> Does this sound like a reasonable approach for your use case?
>
> Thanks
> -Mark
>
> > On Apr 5, 2016, at 3:08 AM,  <
> manoj.ses...@thomsonreuters.com> wrote:
> >
> > For the purposes of symmetry of the NiFi Cluster, and so that the
> initial ingest of content is not limited to just one primary node in the
> NiFi cluster, would it not be beneficial  for the framework to have the
> ability to schedule an Isolated Processor on ANY ONE of available nodes in
> the NiFi Cluster?
> >
> > Regards, Manoj
> >
> > Manoj Seshan - Senior Architect
> > Platform Content Technology, Bangalore
> >
> > Voice: +91-9686578756  +91-80-67492572
>
>


Re: Feature Requests: 1) Embedded FTP/SFTP server 2) Email-ingest-source

2016-04-05 Thread Joe Witt
Manoj,

Are you saying you'd like NiFi to be able to directly accept/act as an
SFTP/FTP server so that when data is pushed to a NiFi server the
administrators do not need to setup additional FTP/SFTP daemons on
those systems?  Such processors would be straightforward to develop
and contribute.  Are you interested in doing so?

Thanks
Joe

On Tue, Apr 5, 2016 at 9:22 AM,   wrote:
> We have several hundred content providers who PUSH CONTENT into our systems 
> via FTP/SFTP. And then some where we grab the body and/or attachments of 
> emails send to specific mailboxes as the source of content. It would augment 
> and round out NiFi's ingest capabilities significantly if processors for 
> these sources were added to the NiFi stable.
>
> Regards
>
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
> Voice: +91-9686578756  +91-80-67492572
>


Feature Requests: 1) Embedded FTP/SFTP server 2) Email-ingest-source

2016-04-05 Thread manoj.seshan
We have several hundred content providers who PUSH CONTENT into our systems via 
FTP/SFTP. And then some where we grab the body and/or attachments of emails 
send to specific mailboxes as the source of content. It would augment and round 
out NiFi's ingest capabilities significantly if processors for these sources 
were added to the NiFi stable.

Regards

Manoj Seshan - Senior Architect
Platform Content Technology, Bangalore
Voice: +91-9686578756  +91-80-67492572



RE: Feature Request: Isolated Processors on ANY ONE node rather than on Primary node alone

2016-04-05 Thread manoj.seshan
Hi Mark - thanks for your prompt response. A few thoughts .. 

a) Currently, when Processor A is configured to run on the Primary Node, in the 
absence of special configuration (e.g. to the rest of the flow configured as a 
Process Group), the downstream Processors in the flow seem to automatically run 
on the Primary Node too. So in a sense, we have the affinity or grouping of 
processors to a given node already, except this is limited to the Primary Node. 
Could we not allow the scheduling of the Isolated Processor to occur on ANY 
single node, rather than just the Primary node? That would suffice for our 
current use case - i.e. we would be perfectly load balanced on initial ingest 
ACROSS the entire cluster, even though the entire downstream flow would run on 
whichever node the isolated processor was (randomly) scheduled on.

b) That said, the "Scheduling Group" paradigm sounds very promising, if that 
includes the ability to Group Processors/Flows, as well as restrict their 
running to Groups-of-nodes. It is even more interesting if the concept can be 
coupled with Multi-tenancy, so cluster-resources (viz. the nodes) can be 
partitioned/isolated-to particular tenants.

Regards, Manoj 

Manoj Seshan - Senior Architect
Platform Content Technology, Bangalore

Voice: +91-9686578756  +91-80-67492572

-Original Message-
From: Mark Payne [mailto:marka...@hotmail.com] 
Sent: Tuesday, April 05, 2016 6:01 PM
To: dev@nifi.apache.org
Subject: Re: Feature Request: Isolated Processors on ANY ONE node rather than 
on Primary node alone

Manoj,

That is a very good point, and it is something that we are working toward.
However, it does get a little bit more complicated than this. If you have some 
Processor, say Processor A running on some arbitrary node, there will often be 
times that you will also need another Processor, Processor B, running on that 
same node.

Using a Primary Node means that we are able to accomplish this easily, but as 
you are noting here, it is quite limiting. In version 1.0.0 of NiFi, one of the 
big changes in a Zero-Master clustering design, whereby the Primary Node is 
automatically elected and fails over to a different node whenever the Primary 
Node leaves the cluster. This improves the overall functionality of Primary 
Node but does not address the issue here, of avoiding scheduling all 
"singleton" processors on the same node.

I think the path that we'd like to take moving forward, post-1.0.0, is to 
provide a mechanism that allows the user to schedule a Processor to run in some 
sort of named "Scheduling Group". So, for instance, you could say Processor A 
and B should both run in "Group A" but Processor C should run in "Group C". 
This way, we can ensure that Processors that need to run together can do so 
while at the same time avoiding the need for all such processors to run on the 
same node.

Does this sound like a reasonable approach for your use case?

Thanks
-Mark

> On Apr 5, 2016, at 3:08 AM,  
>  wrote:
> 
> For the purposes of symmetry of the NiFi Cluster, and so that the initial 
> ingest of content is not limited to just one primary node in the NiFi 
> cluster, would it not be beneficial  for the framework to have the ability to 
> schedule an Isolated Processor on ANY ONE of available nodes in the NiFi 
> Cluster?
>  
> Regards, Manoj
>  
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
> 
> Voice: +91-9686578756  +91-80-67492572



Re: Feature Request: Isolated Processors on ANY ONE node rather than on Primary node alone

2016-04-05 Thread Mark Payne
Manoj,

That is a very good point, and it is something that we are working toward.
However, it does get a little bit more complicated than this. If you have some 
Processor,
say Processor A running on some arbitrary node, there will often be times that 
you will
also need another Processor, Processor B, running on that same node.

Using a Primary Node means that we are able to accomplish this easily, but as 
you are
noting here, it is quite limiting. In version 1.0.0 of NiFi, one of the big 
changes in a Zero-Master
clustering design, whereby the Primary Node is automatically elected and fails 
over to
a different node whenever the Primary Node leaves the cluster. This improves 
the overall
functionality of Primary Node but does not address the issue here, of avoiding 
scheduling all
"singleton" processors on the same node.

I think the path that we'd like to take moving forward, post-1.0.0, is to 
provide a mechanism that
allows the user to schedule a Processor to run in some sort of named 
"Scheduling Group". So,
for instance, you could say Processor A and B should both run in "Group A" but 
Processor C
should run in "Group C". This way, we can ensure that Processors that need to 
run together can
do so while at the same time avoiding the need for all such processors to run 
on the same node.

Does this sound like a reasonable approach for your use case?

Thanks
-Mark

> On Apr 5, 2016, at 3:08 AM,  
>  wrote:
> 
> For the purposes of symmetry of the NiFi Cluster, and so that the initial 
> ingest of content is not limited to just one primary node in the NiFi 
> cluster, would it not be beneficial  for the framework to have the ability to 
> schedule an Isolated Processor on ANY ONE of available nodes in the NiFi 
> Cluster?
>  
> Regards, Manoj
>  
> Manoj Seshan - Senior Architect
> Platform Content Technology, Bangalore
> 
> Voice: +91-9686578756  +91-80-67492572



Re: NiFi 0.4.1 Very slow processing of flow files using PutKafka

2016-04-05 Thread Tim Reardon
My main point was to caution you to not use the PutKafka from 0.6.0. If
time is critical and you cannot wait for an 0.6.1 release, you could try an
intermediate upgrade to 0.5.x, but you would need to use an 0.4.x Kafka nar
to be compatible with Kafka 0.8.x. I listed 0.4.1 as that is what my team
is using.

On Mon, Apr 4, 2016 at 8:47 PM, Oscar dela Pena  wrote:

> Hi,
>
> Thanks for the replies. Your recommendation is to use 0.4.1 Kafka nar,
> correct? Do we need to upgrade NiFi from 0.4.0 to 0.4.1? Or just a nar
> upgrade should be sufficient?
> Does this also mean that 0.4.0 isn't a stable version to use as Kafka
> producer on high data rate?
>
> Thanks,
> Oscar
>
>
> - Original Message -
>
> From: "Tim Reardon" 
> To: dev@nifi.apache.org
> Cc: d...@nifi.incubator.apache.org
> Sent: Monday, April 4, 2016 8:26:23 PM
> Subject: Re: NiFi 0.4.1 Very slow processing of flow files using PutKafka
>
> I wouldn't advise upgrading to 0.6.0 to address PutKafka issues, as there
> is an outstanding bug (NIFI-1701) that truncates messages.
> NiFi 0.5.1 with the 0.4.1 Kafka nar is at least functional.
>
> On Mon, Apr 4, 2016 at 6:48 AM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
> > Oscar
> >
> > Would you mind upgrading to NiFi 0.6.0? There were significant
> > improvements to Kafka module
> >
> > Thanks
> > Oleg
> >
> > On Apr 4, 2016, at 04:21, Oscar dela Pena  wrote:
> >
> > Hi NiFi team,
> >
> > This is our current NiFi flow:
> > Our Kafka is version 0.8.2 and NiFi is version 0.4.0. The two versions
> > should be a match according to source codes and here.
> > 
> > However, PutKafka is extremely slow when processing queued flow files,
> > coming at 40GB/hour rate.
> > We had to add a dynamic property *block.on.buffer.full = true* to get rid
> > of "BufferExhaustedException", and set the buffer size to 4GB.
> >
> > Flow files are plain text files and are delimited by \n(new line).
> > File delimiter is set to \n.
> > Everything else are default processor values.
> >
> > Previous version of Kafka and NiFi doesn't have this problem. We have
> > another running Kafka instance to prove it.
> > We used PutKafka v0.3.0. We built it from source and renamed the
> > processor. It sends to Kafka 0.8.1 the same messages as PutKafka v0.4.0.
> > Same delimiter is used. All messages are sent and no problems. More
> > details on the configurations are posted below.
> >
> > 
> >
> >
> >
> > *PutKafka 0.3.0* PutKafka 0.4.0
> > *Kafka Version* 0.8.1
> > 0.8.2
> > *Errors* No error. All messages are sent to Kafka.
> > Without* block.on.buffer.full = true*, BufferExhaustedException occurs.
> > I added block.on.buffer.full = as a dynamic property and
> > BufferExhaustedException is gone.
> > *Sending records is really slow. *
> > 120GB worth of data get stuck in queue when running the flow for 2hrs.
> >
> > 
> > 
> >
> >
> > How should we configure our PutKafka so that it can have the same
> > performance as the old PutKafka(0.3.0)
> >
> > Thanks,
> > Oscar
> >
> >
> >
>
>


Re: Filtering large CSV files

2016-04-05 Thread Dmitry Goldenberg
Hi Eric,

Thinking about exactly these use-cases, I filed the following JIRA ticket:
NIFI-1716 . It asks for a
SplitCSV processor, and actually for a GetCSV ingress which would address
the issue of reading out of a large CSV treating it as a "data source".  I
was thinking of actually implementing both and committing them.

NIFI-1280  is asking for a
way to filter the CSV columns.  I believe this is best achieved as the CSV
is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
separate step.

I'm not sure that SplitText is the best way to process CSV data to begin
with, because with a CSV, there's a chance that a given cell may spill over
into multiple lines. Such would be the case of embedded newlines within a
single, quoted cell. I don't think SplitText addresses that and that would
be one reason to implement GetCSV/SplitCSV using proper CSV parsing
semantics, the other reason being efficiency of reading.

As far as the limit on the capturing groups, that seems arbitrary. I think
that on GetCSV/SplitCSV, if you have a way to identify the filtered out
columns by their number (index) that should go a long way; perhaps a regex
is also a good option.  I know it may seem that filtering should be a
separate step in a given dataflow but from the point of view of efficiency,
I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
records are being read and processed.

- Dmitry




On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK  wrote:

> Dear all,
>
> I would require to filter large csv files in a data flow. By filtering I
> mean: scale down the file in terms of columns, and looking for a particular
> value to match a parameter. I looked into the example, of csv to JSON. I do
> have a couple of questions:
>
> -First I use a SplitText control get each line of the file. It makes
> things slow, as it seems to generate a flow file for each line. Do I have
> to proceed this way, or is there an alternative? My csv files are really
> large and can have millions of lines.
>
> -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> technique, before using a processor to check for a match, on ${csv.146} for
> instance. Now I have a problem: my csv has 233 fields, so I am getting the
> message: “ReGex is required to have between 1 and 40 capturing groups but
> has 233”. Again, is there another way to proceed, am I missing something?
>
> Best regards,
> Eric


Filtering large CSV files

2016-04-05 Thread Eric FALK
Dear all,

I would require to filter large csv files in a data flow. By filtering I mean: 
scale down the file in terms of columns, and looking for a particular value to 
match a parameter. I looked into the example, of csv to JSON. I do have a 
couple of questions:

-First I use a SplitText control get each line of the file. It makes things 
slow, as it seems to generate a flow file for each line. Do I have to proceed 
this way, or is there an alternative? My csv files are really large and can 
have millions of lines.

-In a second step I am extracting the values with the (.+),(.+),….,(.+)  
technique, before using a processor to check for a match, on ${csv.146} for 
instance. Now I have a problem: my csv has 233 fields, so I am getting the 
message: “ReGex is required to have between 1 and 40 capturing groups but has 
233”. Again, is there another way to proceed, am I missing something?

Best regards,
Eric 

groovy need in the installation?

2016-04-05 Thread aankitakaur
What is the role of NiFi (groovy) ?



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/groovy-need-in-the-installation-tp8787.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


NiFi Installation : Subscribtion required

2016-04-05 Thread Aankita Kaur
 I am unable to install NiFi in Hortonworks 2.3.2 environment.Currently, I
tried https://github.com/abajwa-hw/ambari-nifi-service tutorial but have
not yet started with its services.Also since its half way installed so I
dont know how to remove its services first & start with new installationin
hortonworks environment.Please help
-- 
*Warm Regards*

*Dr. AANKITA KAUR*


Feature Request: Isolated Processors on ANY ONE node rather than on Primary node alone

2016-04-05 Thread manoj.seshan
For the purposes of symmetry of the NiFi Cluster, and so that the initial 
ingest of content is not limited to just one primary node in the NiFi cluster, 
would it not be beneficial  for the framework to have the ability to schedule 
an Isolated Processor on ANY ONE of available nodes in the NiFi Cluster?

Regards, Manoj

Manoj Seshan - Senior Architect
Platform Content Technology, Bangalore
[cid:image001.gif@01C95541.6801BF70]
Voice: +91-9686578756  +91-80-67492572