Hi Aldrinsome additional information.it39s a typical ETL offloading user case each extraction job should foucs on 1 table and 1 table only. data will be written on HDFS , this is similar to Database Staging. The reason why we need to foucs on 1 table for each job is because there might be database error or disconnection occur during the extraction , if it39s running as a script like extraction job with expression langurage, then it39s hard to do the re-running or excape thing on that table or tables.once the extraction is done, a triger like action will do the data cleansing. this is similar to ODS layer of Datawarehousingif the data quality has passed the quality check , then it will be marked as cleaned. otherwise , it will return to previous step and redo the data extraction, or send alert/email to the system administrator.if certain numbers of tables were all cleaned and checked , then it will call some Transforming processor to do the transforming , then push the data into a datawarehouse (Hive in our case)Thank you very much Yan Liu
Hortonworks Service Division Richinfo, Shenzhen, China (PR) 13/03/2016----邮件原文----发件人:"刘岩" <[email protected]>收件人:users <[email protected]>抄 送: dev <[email protected]>发送时间:2016-03-13 00:12:27主题:Re:Re: Multiple dataflow jobs management(lots of jobs)Hi AldrinCurrently we need to extract 60K tables per day , and the time window is limited to 8 Hours. Which means that we need to run jobs concurrently , and we need a general description of what39s going on with all those 60K job flows and take further actions. We have tried Kettle and Talend , Talend is a IDE-Based so not what we are looking for, and Kettle was crashed due to the Mysql cannot handle the Kettle39s metadata with 10K jobs.So we want to use Nifi , this is really the product that we are looking for , but the missing piece here is a DataFlow jobs Admin Page. so we can have multiple Nifi instances running on different nodes, but monitoring the jobs in one page. If it can intergrate with Ambari metrics API, then we can develop an Ambari View for Nifi Jobs Monitoring just like HDFS View and Hive View. Thank you very much Yan Liu Hortonworks Service Division Richinfo, Shenzhen, China (PR) 06/03/2016----邮件原文----发件人:Aldrin Piri <[email protected]>收件人:users <[email protected]>抄 送: dev <[email protected]>发送时间:2016-03-11 02:27:11主题:Re: Mutiple dataflow jobs management(lots of jobs)Hi Yan, We can get more into details and particulars if needed, but have you experimented with expression language? I could see a Cron driven approach which covers your periodic efforts that feeds some number of ExecuteSQL processors (perhaps one for each database you are communicating with) each having a table. This would certainly cut down on the need for the 30k processors on a one-to-one basis with a given processor. In terms of monitoring the dataflows, could you describe what else you are searching for beyond the graph view? NiFi tries to provide context for the flow of data but is not trying to be a sole monitoring, we can give information on a processor basis, but do not delve into specifics. There is a summary view for the overall flow where you can monitor stats about the components and connections in the system. We support interoperation with monitoring systems via push (ReportingTask) and pull (REST API [2]) semantics. Any other details beyond your list of how this all interoperates might shed some more light on what you are trying to accomplish. It seems like NiFi should be able to help with this. With some additional information we may be able to provide further guidance or at least get some insights on use cases we could look to improve upon and extend NiFi to support. Thanks! [1] http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html [2] http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html On Sat, Mar 5, 2016 at 9:25 PM, 刘岩 <[email protected]> wrote:Hi All i39m trying to adapt Nifi to production but can not find an admin console which monitoring the dataflows The scenarios is simple, 1. we gather data from oracle database to hdfs and then to hive. 2. residules/incrementals are updated daily or monthly via Nifi. 3. full dump on some table are excuted daily or monthly via Nifi. is it really simple , however , we have 7 oracle databases with over 30K tables needs to implement the above scenario. which means that i will drag that ExcuteSQL elements for like 30K time or so and also need to place them with a nice looking way on my little 21 inch screen . Just wondering if there is a table list like ,groupable and searchable task control and monitoring feature for Nifi Thank you very much in advance Yan Liu Hortonworks Service Division Richinfo, Shenzhen, China (PR) 06/03/2016
