[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread trkurc
Github user trkurc commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196112326
  
@apiri - I attached a patch to the ticket, which should be the last two 
commits off my branch https://github.com/trkurc/nifi/commits/NIFI-1481


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196111289
  
Patch to apply on top of the PR is probably simplest, but am good with 
whatever is easiest for you. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread trkurc
Github user trkurc commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196104511
  
@apiri: how would you think it best to review the changes I made based on 
your reviews? Another PR? A patch on the ticket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196102613
  
@trkurc I think it might be a fair concession to have both this and Windows 
punted. Maybe we just roll in the check piggybacking off of the already 
existing $cygwin and figure out how we can improve later?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread trkurc
Github user trkurc commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196092752
  
@apiri - cygwin is proving to be a challenge, and not necessarily due to 
the changes in this patch. 

On my setup, the ':' separator on this line seems to break things (I 
suspect because my java is expecting ';' separators for things on the classpath)

See:

https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-resources/src/main/resources/bin/nifi.sh#L192


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1620 Allow empty Content-Type in InvokeHTT...

2016-03-13 Thread joewitt
Github user joewitt commented on the pull request:

https://github.com/apache/nifi/pull/272#issuecomment-196089886
  
i was reviewing this earlier today and frankly had a similar concern to 
this as Adam.  I didn't reply because I hadn't really figured out what to 
think.  First, I agree that a service which rejects that header is arguably 
broken.  Second, as the patch is right now I am curious how it works when the 
value is empty string because there is a static call to MediaType which 
seems like it would have trouble (still need to verify the logic there though).

However, having said this Pierre can you clarify if the intent is only for 
the case where there is no entity body or is it also for when there is an 
entity body in the request?  If the idea is that this is only necessary when 
there is no entity body we should tighten the code for that case and if it is 
for either scenario then I think i'm of similar mind to Adam here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NiFi-1481 Enhancement[ nifi.sh env]

2016-03-13 Thread trkurc
Github user trkurc commented on the pull request:

https://github.com/apache/nifi/pull/218#issuecomment-196080641
  
After an out of band discussion with @markap14, seems that windows pids 
might be challenging to get, so maybe we should leave the Windows env batch 
script out for 0.6.0. Any objections?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1620 Allow empty Content-Type in InvokeHTT...

2016-03-13 Thread taftster
Github user taftster commented on the pull request:

https://github.com/apache/nifi/pull/272#issuecomment-196079414
  
I'm not entirely sure if this is a good idea.  Any web service which 
_disallows_ a standard HTTP header is arguably broken.  Quoting RFC 2616:

>   Any HTTP/1.1 message containing an entity-body SHOULD include a
>   Content-Type header field defining the media type of that body. If
>   and only if the media type is not given by a Content-Type field, the
>   recipient MAY attempt to guess the media type via inspection of its
>   content and/or the name extension(s) of the URI used to identify the
>   resource. If the media type remains unknown, the recipient SHOULD
>   treat it as type "application/octet-stream".

It seems hard to believe given the above that a web service would reject a 
response with Content-Type.  The current behavior of InvokeHTTP is possibly the 
most consistent with the HTTP specification.

A custom processor designed to specifically interact with the remote 
service in question should be considered as an alternative to modifying 
InvokeHTTP.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: InvokeHTTP body

2016-03-13 Thread Adam Taft
I think it makes total sense that POST/PUT requests read from the flowfile
content.  Therefore, the problem should be fixed further up in the flow
design.  For example, try these solutions:

GenerateFlowFile -> ReplaceText -> InvokeHTTP   (or)
GetFile -> InvokeHTTP

The problem you're describing has more to do with generating static
flowfile content, which is a separate concern from how to transfer flowfile
content over the wire via http.

If the above solutions don't work for you, perhaps a modification of
GenerateFlowFile could be made which uses static content instead of random
content?

Hope this helps.

Adam


On Fri, Mar 11, 2016 at 6:56 AM, Pierre Villard  wrote:

> Hi,
>
> Would it make sense to add a property "body" allowing the user to manually
> set the body of the request for PUT/POST requests?
>
> At the moment, the body of the request seems to be only set with the
> content of incoming flow files. But it is possible to use this processor
> without incoming relationship. It would be useful to be able to set the
> body manually.
>
> The behaviour would be: if there is an incoming relationship, the incoming
> flow file content is used whatever the property "body" is, and if there is
> no incoming relationship, the request body is based on the property value.
>
> What do you think?
>
> Pierre
>


[GitHub] nifi pull request: Nifi 1516 - AWS DynamoDB Get/Put/Delete Process...

2016-03-13 Thread apiri
Github user apiri commented on the pull request:

https://github.com/apache/nifi/pull/224#issuecomment-196043097
  
@mans2singh thanks for getting this updated, will start looking over it 
tonight/tomorrow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Split Content (One-to-Many) early commit

2016-03-13 Thread Mark Payne
Devin,

We do realize that we have some work to do in order to make it so that
a single Processor can buffer up hundreds of thousands or more FlowFiles.
The SplitText processor is very popular and suffers from this exact same 
problem.
We want to have a mechanism for swapping those out of the Java Heap, similar
to how we do when we have millions of FlowFiles sitting in a queue. There is a 
ticket
here [1] to address this. However, this has turned out to be very time 
consuming, and
not quite a straight-forward as we had hoped, so it's not been finished up yet.

In the meantime, you can use the approach that you described, using two 
different
Process Sessions, by extending AbstractSessionFactoryProcessor instead of
AbstractProcessor. The downside to this approach, though, is that when NiFi is 
restarted,
you could potentially have a lot of data duplication.

As an example, let's imagine that you create a ProcessSession and use it to 
create 10,000 FlowFiles
and then commit the session and create a new one. If you have an incoming 
FlowFiles that has
1 million rows in it, you may create 800,000 FlowFiles and send them out and 
then NiFi gets restarted.
In this case, you will pick up the original FlowFile and begin processing it 
again. But you've already sent
out those 800,000 FlowFiles. Depending on your requirements, this may or may 
not be acceptable.

One option that you could use is just to document that this behavior exists and 
that SplitText should be
used ahead of you Processor in order to split the content into 10,000 line 
chunks. This would avoid the
heap exhaustion.

Another possible solution that you could use, though it's not as pretty as I'd 
like: Process up to 10,000 FlowFiles
from an input FlowFile. Then, add an attribute to the input FlowFile indicating 
your progress (for instance,
add an attribute named "rows.converted" and then do 
"session.transfer(flowFile);" This will transfer the FlowFile
back into its input queue. You can then commit the session. Then, when you call 
session.get() to get an input
FlowFile again, you can check for that attribute and skip that many rows. This 
way, you won't end up with
data duplication. The downside here is that you would end up reading the first 
N rows each time and ignoring
the content which can be expensive. A more optimized approach would be to wrap 
the InputStream in
a ByteCountingInputStream and record the number of bytes consumed and use that 
as an attribute, and then
for each subsequent iteration use StreamUtils.skip() to skip the appropriate 
number of bytes.

I know there's a lot of info here. Let me know if anything doesn't make sense.

I hope this helps!
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-1008 



> On Mar 11, 2016, at 5:29 PM, Devin Fisher 
>  wrote:
> 
> I'm creating a processor that will read a customer csv and will create a
> new flowfile for each line in the form of XML. The CSV file will be quite
> large (100s of thousands of lines). I would like to commit a reasonable
> amount from time to time so that they can flow down to other processors.
> But looking at similar processors SplitText and SplitXml they save up all
> the created flowfiles and release them all at the end.  In some trials, I'm
> running out of memory doing that. But I can't commit the session early
> because I'm still reading the original CSV file.  Is there a workflow where
> I can read the incoming CSV flowfile but still release created flowfiles?
> I'm thinking of not using AbstractProcessor and instead
> use AbstractSessionFactoryProcessor and create two different sessions but
> is that advisable or possible?
> 
> Devin



[GitHub] nifi pull request: NIFI-627 incorporates mwmoser patch and some mi...

2016-03-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/274


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-627 removed flowfile penalization which co...

2016-03-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi/pull/268


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re:Re: Re: Multiple dataflow jobs management(lots of jobs)

2016-03-13 Thread 刘岩
Hi  ThadThank you very much for your advice. Kettle can do the job for sure , 
but the metadata i was talking about is the metadata of the job descriptions 
used for kettle itself. The only option left for kettle is multiple instances , 
but that also means that we need to develop a master application to gather all 
the instances metadata. Moreover , Kettle does not have a Web Based GUI for 
designing and testing the job , that39s why we want NIFI , but again , multiple 
instances of nifi also leads to a HA problem for master node, so we turn to 
ambari metrics for that issue.Talend has a cloud server doing the similar 
thing, but it39s running on public cloud which is not accepted by our 
client.Kettle is a great ETL tool, but Web Based designer is really the master 
point for future.Thank you very muchYan LiuYan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

14/03/2016邮件原文发件人:Thad Guidry  收件人:users 
抄 送: dev  发送时间:2016-03-13 
23:04:39主题:Re: Re: Multiple dataflow jobs management(lots of jobs)
Yan,



Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K jobs 
to accomplish this is not the proper way to setup Pentaho.  Also, using MySQL 
to store the metadata is where you made a wrong choice.  PostgreSQL with data 
silos on SSD drives would be a better choice, while properly doing Async config 
[1] and other necessary steps for high writes.  Don39t keep Pentaho39s Table 
output commit levels at their default of 10k rows when your processing millions 
of rows!) For Oracle 11g or PostgreSQL, where I need 30 sec time slice windows 
for the metadata logging and where I typically have less than 1k of data on 
average per row, I typically will choose 200k rows or more in Pentaho39s table 
output commit option.



I would suggest you contact Pentaho for some adhoc support or hire some 
consultants to help you learn more, or setup properly for your use case.  For 
free, you can also just do a web search on "Pentaho best practices".  There39s 
a lot to learn from industry experts who already have used these tools and know 
their quirks.



[1] 
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR





Thad+ThadGuidry






On Sat, Mar 12, 2016 at 11:00 AM, 刘岩  wrote:Hi Aldrinsome 
additional information.it39s a typical ETL offloading user case each extraction 
job should foucs on 1 table and 1 table only.  data will be written on HDFS , 
this is similar to Database Staging. The reason why we need to foucs on 1 table 
for each job is because there might be database error or disconnection occur 
during the extraction , if it39s running as  a script like extraction job with 
expression langurage, then it39s hard to do the re-running or excape thing on 
that table or tables.once the extraction is done, a triger like action will do 
the data cleansing.  this is similar to ODS layer of Datawarehousingif the data 
quality has passed the quality check , then it will be marked as cleaned. 
otherwise , it will return to previous step and redo the data extraction, or 
send alert/email to the  system administrator.if certain numbers of tables were 
all cleaned and checked , then it will  call some Transforming  processor to do 
the transforming , then push the data into a datawarehouse (Hive in our 
case)Thank you very much Yan Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

13/03/2016邮件原文发件人:"刘岩" 收件人:users  
抄 送: dev  发送时间:2016-03-13 
00:12:27主题:Re:Re: Multiple dataflow jobs management(lots of jobs)
Hi AldrinCurrently  we need to extract 60K tables per day , and the time window 
is limited to 8 Hours.  Which means that we need to run jobs concurrently , and 
we need a general description of what39s going on with all those 60K job flows 
and take further actions.  We have tried Kettle and Talend ,  Talend is a 
IDE-Based so not what we are looking for,  and Kettle was crashed due to the 
Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi 
,  this is really the product that we are looking for , but  the missing piece 
here is a DataFlow jobs Admin Page.  so we can have multiple Nifi instances 
running on different nodes, but monitoring the jobs in one page.  If it can 
intergrate with Ambari metrics API,  then we can develop an Ambari View for 
Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan 
Liu 

Hortonworks Service Division 

Richinfo, Shenzhen, China (PR)

06/03/2016邮件原文发件人:Aldrin Piri  收件人:users 
抄 送: dev  发送时间:2016-03-11 
02:27:11主题:Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan,
We can get more into details and particulars if needed, but have you 
experimented with expression language?  I 

Re: Re: Multiple dataflow jobs management(lots of jobs)

2016-03-13 Thread Thad Guidry
Yan,

Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K
jobs to accomplish this is not the proper way to setup Pentaho.  Also,
using MySQL to store the metadata is where you made a wrong choice.
PostgreSQL with data silos on SSD drives would be a better choice, while
properly doing Async config [1] and other necessary steps for high writes.
Don't keep Pentaho's Table output commit levels at their default of 10k
rows when your processing millions of rows!) For Oracle 11g or PostgreSQL,
where I need 30 sec time slice windows for the metadata logging and where I
typically have less than 1k of data on average per row, I typically will
choose 200k rows or more in Pentaho's table output commit option.

I would suggest you contact Pentaho for some adhoc support or hire some
consultants to help you learn more, or setup properly for your use case.
For free, you can also just do a web search on "Pentaho best practices".
There's a lot to learn from industry experts who already have used these
tools and know their quirks.

[1]
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR


Thad
+ThadGuidry 

On Sat, Mar 12, 2016 at 11:00 AM, 刘岩  wrote:

> Hi Aldrin
>
> some additional information.
>
> it's a typical ETL offloading user case
>
> each extraction job should foucs on 1 table and 1 table only.  data will
> be written on HDFS , this is similar to Database Staging.
>
> The reason why we need to foucs on 1 table for each job is because there
> might be database error or disconnection occur during the extraction , if
> it's running as  a script like extraction job with expression langurage,
> then it's hard to do the re-running or excape thing on that table or tables.
>
> once the extraction is done, a triger like action will do the data
> cleansing.  this is similar to ODS layer of Datawarehousing
>
> if the data quality has passed the quality check , then it will be marked
> as cleaned. otherwise , it will return to previous step and redo the data
> extraction, or send alert/email to the  system administrator.
>
> if certain numbers of tables were all cleaned and checked , then it will
> call some Transforming  processor to do the transforming , then push the
> data into a datawarehouse (Hive in our case)
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 13/03/2016
>
> 邮件原文
> *发件人:*"刘岩" 
> *收件人:*users  
> *抄 送: *dev  
> *发送时间:*2016-03-13 00:12:27
> *主题:*Re:Re: Multiple dataflow jobs management(lots of jobs)
>
>
> Hi Aldrin
>
> Currently  we need to extract 60K tables per day , and the time window is
> limited to 8 Hours.  Which means that we need to run jobs concurrently ,
> and we need a general description of what's going on with all those 60K job
> flows and take further actions.
>
> We have tried Kettle and Talend ,  Talend is a IDE-Based so not what we
> are looking for,  and Kettle was crashed due to the Mysql cannot handle the
> Kettle's metadata with 10K jobs.
>
> So we want to use Nifi ,  this is really the product that we are looking
> for , but  the missing piece here is a DataFlow jobs Admin Page.  so we can
> have multiple Nifi instances running on different nodes, but monitoring the
> jobs in one page.  If it can intergrate with Ambari metrics API,  then we
> can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and
> Hive View.
>
>
> Thank you very much
>
> Yan Liu
>
> Hortonworks Service Division
>
> Richinfo, Shenzhen, China (PR)
> 06/03/2016
>
>
> 邮件原文
> *发件人:*Aldrin Piri  
> *收件人:*users 
> *抄 送: *dev  
> *发送时间:*2016-03-11 02:27:11
> *主题:*Re: Mutiple dataflow jobs management(lots of jobs)
>
> Hi Yan,
>
> We can get more into details and particulars if needed, but have you
> experimented with expression language?  I could see a Cron driven approach
> which covers your periodic efforts that feeds some number of ExecuteSQL
> processors (perhaps one for each database you are communicating with) each
> having a table.  This would certainly cut down on the need for the 30k
> processors on a one-to-one basis with a given processor.
>
> In terms of monitoring the dataflows, could you describe what else you are
> searching for beyond the graph view?  NiFi tries to provide context for the
> flow of data but is not trying to be a sole monitoring, we can give
> information on a processor basis, but do not delve into specifics.  There
> is a summary view for the overall flow where you can monitor stats about
> the components and connections in the system. We support interoperation
> with monitoring systems via push (ReportingTask) and pull (REST API [2])
> semantics.
>
> Any other details beyond your list of how this all interoperates might
> shed some