Re: Text and metadata extraction processor

2016-03-31 Thread Dmitry Goldenberg
Simon,

I believe we've moved on past the 'mode' option and have now switched to
talking about how the include/exclude filters, for metadata and content, on
the one hand side, and filename or MIME type based, on the other hand side,
would drive whether meta, content, or both would get extracted.

For example, a user could configure the ExtractMediaAttributes processor to
extract metadata for all image files (but not content), extract content
only for plain text documents (but no metadata), or both meta and content
for documents with an extension ".pqr", based on the filename.

Could you elaborate on your vision of how relationships could "drive" this
type of functionality?  Joe has already built some of the filtering into
the processor; I just suggested to extend that further, and we get all the
bases covered.

I'm not sure I followed your comment on the extracted content being
transferred into a new FlowFile.  My thoughts were that the extracted
content would be inserted into a new, dedicated field, called for example,
"text", on *the same* FlowFile.  I imagine that for a lot of use-cases,
especially data ingestion into a search engine, the extracted attributes
*and* the extracted text must travel together as part of the ingested
document, with the original flowfile-content most likely getting dropped on
the way into the index.

I guess an alternative could be to have an option to represent the
extraction results as a new document, and an option to drop the original,
and an option to copy the original's attributes onto the new doc. Seems
rather complex.  I like the "in-place" extraction.

Could you also elaborate on how a controller service would handle OCR?
When a document floats into ExtractMediaAttributes, assuming Tesseract is
installed properly, Tika will already automatically fire off OCR.  Unless
we turn that off and cause OCR to only be supported via this service.  I'm
tempted to say why don't we just let Tika do its job for all cases, OCR
included.  Caveat being that OCR is expensive and it would be nice to have
ways of ensuring it has enough resources and doesn't bog the flow down.

For the PDF processor, I'm thinking, yes, PDFBox to break it up into pages
and then apply Tika page by page, then aggregate the output together, with
a configurable max of up to N pages per document to process (due to how
slow OCR is).  I already have a prototype of this going, I'll file a JIRA
ticket for this feature.

- Dmitry



On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball  wrote:

> What I’m suggesting is a single processor for both, but instead of using a
> mode property to determine which bits get extracted, you use the state of
> the relations on the processor to configure which options tika uses and
> using a single pass to actually parse metadata into attributes, and content
> into a new flow file transfer into the parsed relation.
>
> On the tesseract front, it may make sense to do this through a controller
> service.
>
> A PDF processor might be interesting. Are you thinking of something like
> PDFBox, or tika again?
>
> Simon
>
>
> > On 1 Apr 2016, at 01:30, Dmitry Goldenberg 
> wrote:
> >
> > Simon,
> >
> > Interesting commentary.  The issue that Joe and I have both looked at,
> with
> > the splitting of metadata and content extraction, is that if they're
> split
> > then the underlying Tika extraction has to process the file twice: once
> to
> > pull out the attributes and once to pull out the content.  Perhaps it may
> > be good to add ExtractMetadata and ExtractTextContent in addition to
> > ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
> >
> > It seems prudent to provide one wholesome, out-of-the-box extractor
> > processor with options to extract just metadata, just content, or both
> > metadata and content.
> >
> > I think what I'm hearing is that we need to allow for checking somewhere
> > for whether text/content has already been extracted by the time we get to
> > the ExtractMediaAttributes processor - ?  If that is the issue then I
> > believe the user would use RouteOnAttribute and if the content is already
> > filled in then they'd not route to ExtractMediaAttributes.
> >
> > As far as the OCR.  Tika internally supports OCR by directing image files
> > to Tesseract (if Tesseract is installed and configured properly).  We've
> > started talking about how this could be reconciled in the
> > ExtractMediaAttributes.
> >
> > I think that once we have the basic ExtractMediaAttributes, we could add
> > filters for what files to enable the OCR on, and we'd need to expose a
> few
> > config parameters specific to OCR, such as e.g. the location of the
> > Tesseract installation and the maximum file size on which to attempt the
> > OCR.  Perhaps there can also be a RunOCR processor which would be
> dedicated
> > to running OCR.  But since Tika already has OCR integrated we'd probably
> > want to take care of that in the ExtractMediaAttributes configuration.
> >
> > Additionally

Re: Text and metadata extraction processor

2016-03-31 Thread Simon Ball
What I’m suggesting is a single processor for both, but instead of using a mode 
property to determine which bits get extracted, you use the state of the 
relations on the processor to configure which options tika uses and using a 
single pass to actually parse metadata into attributes, and content into a new 
flow file transfer into the parsed relation. 

On the tesseract front, it may make sense to do this through a controller 
service. 

A PDF processor might be interesting. Are you thinking of something like 
PDFBox, or tika again?

Simon


> On 1 Apr 2016, at 01:30, Dmitry Goldenberg  wrote:
> 
> Simon,
> 
> Interesting commentary.  The issue that Joe and I have both looked at, with
> the splitting of metadata and content extraction, is that if they're split
> then the underlying Tika extraction has to process the file twice: once to
> pull out the attributes and once to pull out the content.  Perhaps it may
> be good to add ExtractMetadata and ExtractTextContent in addition to
> ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
> 
> It seems prudent to provide one wholesome, out-of-the-box extractor
> processor with options to extract just metadata, just content, or both
> metadata and content.
> 
> I think what I'm hearing is that we need to allow for checking somewhere
> for whether text/content has already been extracted by the time we get to
> the ExtractMediaAttributes processor - ?  If that is the issue then I
> believe the user would use RouteOnAttribute and if the content is already
> filled in then they'd not route to ExtractMediaAttributes.
> 
> As far as the OCR.  Tika internally supports OCR by directing image files
> to Tesseract (if Tesseract is installed and configured properly).  We've
> started talking about how this could be reconciled in the
> ExtractMediaAttributes.
> 
> I think that once we have the basic ExtractMediaAttributes, we could add
> filters for what files to enable the OCR on, and we'd need to expose a few
> config parameters specific to OCR, such as e.g. the location of the
> Tesseract installation and the maximum file size on which to attempt the
> OCR.  Perhaps there can also be a RunOCR processor which would be dedicated
> to running OCR.  But since Tika already has OCR integrated we'd probably
> want to take care of that in the ExtractMediaAttributes configuration.
> 
> Additionally, I've proposed the idea of a ProcessPDF processor which would
> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break
> it up into pages and run OCR on each page, then aggregate the extracted
> text.
> 
> - Dmitry
> 
> 
> 
> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball  wrote:
> 
>> Just a thought…
>> 
>> To keep consistent with other Nifi Parse patterns, would it make sense to
>> based the extraction of content on the presence of a relation. So your tika
>> processor would have an original relation which would have meta data
>> attached as attributed, and an extracted relation which would have the
>> metadata and the processed content (text from OCRed image for example).
>> That way you can just use context.hasConnection(relationship) to determine
>> whether to enable the tika content processing.
>> 
>> This seems more idiomatic than a mode flag.
>> 
>> Simon
>> 
>>> On 31 Mar 2016, at 19:48, Joe Skora  wrote:
>>> 
>>> Dmitry,
>>> 
>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
>> FILTER"
>>> entries referred to some MIME type of the metadata, but you meant to use
>>> the file's MIME type to select what files have metadata extracted.
>>> 
>>> Sorry, about that, I think we are on the same page.
>>> 
>>> Joe
>>> 
>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
>>> dgoldenb...@hexastax.com> wrote:
>>> 
 Hi Joe,
 
 I think if we have the filters in place then there's no need for the
>> 'mode'
 enum, as the filters themselves guide the processor in deciding whether
 metadata and/or content is extracted for a given input file.
 
 Agreed on the handling of archives as a separate processor (template,
>> seems
 like).
 
 I think it's easiest to do both metadata and/or content in one processor
 since it can tell Tika whether to extract metadata and/or content, in
>> one
 pass over the file bytes (as you pointed out).
 
 Agreed on the exclusions trumping inclusions; I think that makes sense.
 
>> We will only have a mimetype for the original flow file itself so I'm
 not sure about the metadata mimetype filter.
 
 I'm not sure where there might be an issue here. The metadata MIME type
 filter tells the processor for which MIME types to perform the metadata
 extraction.  For instance, extract metadata for images and videos, only.
 This could possibly be coupled with an exclusion filter for content that
 says, don't try to extract content from images and videos.
 
 I think with the six filters we get all the bases covered:
>

Re: Text and metadata extraction processor

2016-03-31 Thread Dmitry Goldenberg
Simon,

Interesting commentary.  The issue that Joe and I have both looked at, with
the splitting of metadata and content extraction, is that if they're split
then the underlying Tika extraction has to process the file twice: once to
pull out the attributes and once to pull out the content.  Perhaps it may
be good to add ExtractMetadata and ExtractTextContent in addition to
ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.

It seems prudent to provide one wholesome, out-of-the-box extractor
processor with options to extract just metadata, just content, or both
metadata and content.

I think what I'm hearing is that we need to allow for checking somewhere
for whether text/content has already been extracted by the time we get to
the ExtractMediaAttributes processor - ?  If that is the issue then I
believe the user would use RouteOnAttribute and if the content is already
filled in then they'd not route to ExtractMediaAttributes.

As far as the OCR.  Tika internally supports OCR by directing image files
to Tesseract (if Tesseract is installed and configured properly).  We've
started talking about how this could be reconciled in the
ExtractMediaAttributes.

I think that once we have the basic ExtractMediaAttributes, we could add
filters for what files to enable the OCR on, and we'd need to expose a few
config parameters specific to OCR, such as e.g. the location of the
Tesseract installation and the maximum file size on which to attempt the
OCR.  Perhaps there can also be a RunOCR processor which would be dedicated
to running OCR.  But since Tika already has OCR integrated we'd probably
want to take care of that in the ExtractMediaAttributes configuration.

Additionally, I've proposed the idea of a ProcessPDF processor which would
ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break
it up into pages and run OCR on each page, then aggregate the extracted
text.

- Dmitry



On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball  wrote:

> Just a thought…
>
> To keep consistent with other Nifi Parse patterns, would it make sense to
> based the extraction of content on the presence of a relation. So your tika
> processor would have an original relation which would have meta data
> attached as attributed, and an extracted relation which would have the
> metadata and the processed content (text from OCRed image for example).
> That way you can just use context.hasConnection(relationship) to determine
> whether to enable the tika content processing.
>
> This seems more idiomatic than a mode flag.
>
> Simon
>
> > On 31 Mar 2016, at 19:48, Joe Skora  wrote:
> >
> > Dmitry,
> >
> > I think we're good.  I was confused because "XXX_METADATA MIMETYPE
> FILTER"
> > entries referred to some MIME type of the metadata, but you meant to use
> > the file's MIME type to select what files have metadata extracted.
> >
> > Sorry, about that, I think we are on the same page.
> >
> > Joe
> >
> > On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> > dgoldenb...@hexastax.com> wrote:
> >
> >> Hi Joe,
> >>
> >> I think if we have the filters in place then there's no need for the
> 'mode'
> >> enum, as the filters themselves guide the processor in deciding whether
> >> metadata and/or content is extracted for a given input file.
> >>
> >> Agreed on the handling of archives as a separate processor (template,
> seems
> >> like).
> >>
> >> I think it's easiest to do both metadata and/or content in one processor
> >> since it can tell Tika whether to extract metadata and/or content, in
> one
> >> pass over the file bytes (as you pointed out).
> >>
> >> Agreed on the exclusions trumping inclusions; I think that makes sense.
> >>
>  We will only have a mimetype for the original flow file itself so I'm
> >> not sure about the metadata mimetype filter.
> >>
> >> I'm not sure where there might be an issue here. The metadata MIME type
> >> filter tells the processor for which MIME types to perform the metadata
> >> extraction.  For instance, extract metadata for images and videos, only.
> >> This could possibly be coupled with an exclusion filter for content that
> >> says, don't try to extract content from images and videos.
> >>
> >> I think with the six filters we get all the bases covered:
> >>
> >>   1. include metadata? --
> >>  1. yes --
> >> 1. determine the inclusion of metadata by filename pattern
> >> 2. determine the inclusion of metadata by MIME type pattern
> >>  2. no --
> >> 1. determine the exclusion of metadata by filename pattern
> >> 2. determine the exclusion of metadata by MIME type pattern
> >>  2. include content? --
> >>  1. yes --
> >> 1. determine the inclusion of content by filename pattern
> >> 2. determine the inclusion of content by MIME type pattern
> >>  2. no --
> >> 1. determine the exclusion of content by filename pattern
> >> 2. determine the exclusion of content by MIME type pattern
> >>
>

[GitHub] nifi-minifi pull request: MINIFI-9 initial commit for boostrapping...

2016-03-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi-minifi/pull/4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Zookeeper issues mentioned in a talk about storm / heron

2016-03-31 Thread Sean Busbey
HBase has also had issues with ZK at modest (~couple hundred node)
scale when using it to act as an intermediary for heartbeats and to do
work assignment.

On Thu, Mar 31, 2016 at 4:33 PM, Tony Kurc  wrote:
> My commentary that didn't accompany this - it appears Storm was using
> zookeeper in a similar way as the road we're heading down, and it was such
> a major bottleneck that they moved key value storage and heartbeating out
> into separate services, and then re-engineering (i.e. built Heron). Before
> we get too dependent on zookeeper, may be worth learning some lessons from
> the crew that built Heron or from a team that learned zookeeper lessons
> scale like accumulo.
>
> On Thu, Mar 24, 2016 at 6:22 PM, Tony Kurc  wrote:
>
>> I mentioned slides I saw at the meetup about zookeeper perils at scale in
>> storm, here are slides, i couldn't find a video after some limited
>> searching.
>> https://qconsf.com/system/files/presentation-slides/heron-qcon-2015.pdf
>>


Re: Zookeeper issues mentioned in a talk about storm / heron

2016-03-31 Thread Tony Kurc
My commentary that didn't accompany this - it appears Storm was using
zookeeper in a similar way as the road we're heading down, and it was such
a major bottleneck that they moved key value storage and heartbeating out
into separate services, and then re-engineering (i.e. built Heron). Before
we get too dependent on zookeeper, may be worth learning some lessons from
the crew that built Heron or from a team that learned zookeeper lessons
scale like accumulo.

On Thu, Mar 24, 2016 at 6:22 PM, Tony Kurc  wrote:

> I mentioned slides I saw at the meetup about zookeeper perils at scale in
> storm, here are slides, i couldn't find a video after some limited
> searching.
> https://qconsf.com/system/files/presentation-slides/heron-qcon-2015.pdf
>


[GitHub] nifi-minifi pull request: MINIFI-9 initial commit for boostrapping...

2016-03-31 Thread JPercivall
GitHub user JPercivall opened a pull request:

https://github.com/apache/nifi-minifi/pull/4

MINIFI-9 initial commit for boostrapping/init process

An initial commit for the minifi-boostrap. This is a repurposed 
nifi-bootstrap. It properly brings in the correct libs to start up but of 
course will relies on a MiNiFi.java class to start.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JPercivall/nifi-minifi MINIFI-9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi-minifi/pull/4.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4


commit d492288bf8a800093367fe635e878925c17b36c9
Author: Joseph Percivall 
Date:   2016-03-31T21:31:12Z

MINIFI-9 initial commit for boostrapping/init process




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Question about NiFi HA

2016-03-31 Thread Joe Witt
Chris,

Absolutely: 
https://cwiki.apache.org/confluence/display/NIFI/High+Availability+Processing

Thanks
Joe

On Thu, Mar 31, 2016 at 5:06 PM, McDermott, Chris Kevin (MSDU -
STaTS/StorefrontRemote)  wrote:
> Hey, guys and gals.  I’m pretty new to NiFi so its quite possible, if not 
> likely, that I am missing something.   That said, would someone please 
> comment on this observation. conclusion and and questions.
>
> My observation is that FlowFiles get queued in local queues on individual 
> nodes and those flow files and queues are not replicated to other nodes.  So 
> its my conclusion that if a node is lost, the files queued to that node are 
> also lost (or at least stuck there until if and when the node is recovered.)
>
> While I understand that there might be ways to work around this, by using 
> separate NiFi clusters and remote process groups, and duplicate the flow 
> processing, is there anything on the road map that would address this in NiFi 
> itself? Storing the queues and flow flies on a distributed file system like 
> HDFS, comes to mind, so that any “local” queue is distributed to more than 
> one node.
>
> Thanks in advance,
>
> Chris


Question about NiFi HA

2016-03-31 Thread McDermott, Chris Kevin (MSDU - STaTS/StorefrontRemote)
Hey, guys and gals.  I’m pretty new to NiFi so its quite possible, if not 
likely, that I am missing something.   That said, would someone please comment 
on this observation. conclusion and and questions.

My observation is that FlowFiles get queued in local queues on individual nodes 
and those flow files and queues are not replicated to other nodes.  So its my 
conclusion that if a node is lost, the files queued to that node are also lost 
(or at least stuck there until if and when the node is recovered.)

While I understand that there might be ways to work around this, by using 
separate NiFi clusters and remote process groups, and duplicate the flow 
processing, is there anything on the road map that would address this in NiFi 
itself? Storing the queues and flow flies on a distributed file system like 
HDFS, comes to mind, so that any “local” queue is distributed to more than one 
node.

Thanks in advance,

Chris


[GitHub] nifi pull request: NIFI-1678

2016-03-31 Thread markap14
GitHub user markap14 opened a pull request:

https://github.com/apache/nifi/pull/317

NIFI-1678

Refactored NCM & Nodes so that nodes send heartbeats to ZooKeeper and NCM 
reads heartbeats from ZooKeeper.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markap14/nifi NIFI-1678

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/317.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #317


commit 9bb8a9d8c09b92f05333203f9e4a70dae4383d8c
Author: Mark Payne 
Date:   2016-03-23T17:16:13Z

NIFI-483: Use ZooKeeper's Leader Election to determine Primary Node

commit 9b803c15c9d1573d6a7d9ae5009009fb328793d1
Author: Mark Payne 
Date:   2016-03-24T15:49:08Z

NIFI-1678: Started refactoring heartbeating mechanism, using a new package: 
org.apache.nifi.cluster.coordination




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] git branching model

2016-03-31 Thread Matt Gilman
The majority consensus is to have master point to our 1.x baseline going
forward. Unless there are any strong objections I will set everything up on
Monday (4/4) morning.

- Create a 0.x branch for all future 0.x releases based on the current
state of master.
- Apply all 1.x commits from the temporary 1.x branch to master.
- Delete the temporary 1.x branch.
- Update the quickstart page and contribution guide to detail the
distinction between the 0.x and master branches.
- Send another email to @dev once this has been completed.

Reminder: Going forward once this has been completed all commits will need
to be made to both branches as appropriate.

Thanks!

Matt

On Tue, Mar 29, 2016 at 3:05 PM, Matt Gilman 
wrote:

> Matt,
>
> I agree that the PRs would need to be merged to both baselines at
> contribution time. If the contribution applies cleanly the reviewer could
> certainly handle the commit themselves. However, if additional code changes
> are required because the baselines have diverged, the contributor would
> probably need to submit another PR. This additional effort should only be
> necessary until we're able to perform the first 1.x release.
>
> Aldrin,
>
> I definitely understand your thoughts regarding (1) and (2). This is why I
> wanted to pose the options before just jumping into one approach vs the
> other. I personally prefer the GitHub style PR process. I realize this is
> more cumbersome but hopefully the number of conflicts should be small as
> folks are already starting to focus their efforts on the framework for 1.x.
>
> Matt
>
> On Tue, Mar 29, 2016 at 10:55 AM, Aldrin Piri 
> wrote:
>
>> I think I prefer option 2 considering, what may be the incorrect
>> assumption, that rebasing 1.x on 0.x / pushing into 1.x would be easier.
>> Based on outstanding PRs/Patches in conjunction with release cadence there
>> will be more 0.x releases planned. Until we reached the point where the
>> first 1.x release is in sight, I think (2) makes sense just from
>> minimizing
>> impedance where the majority of effort will occur (new/updated extensions)
>> and then switching to (1) when we are scheduling 1.x as next (exclusive of
>> any patch builds).  This seems to work out when I try to reason about it,
>> but admittedly, am coming at this heavily from my own anecdotal
>> perspective
>> given my flow of reviewing.
>>
>> Matt, excellent points to consider.
>>
>> Do not want to go too much on a tangent from the current conversation, but
>> I think we need to harness automation as much as possible.  Not sure
>> Travis
>> can do this or do so easily (short of two PRs) and this may arguably shift
>> things in favor of patches and the model that the other ASF projects
>> utilize with buildbot.  Getting as much done asynchronously for us is
>> obviously important but we also have to strive to avoid a contrib process
>> that is too cumbersome as well.
>>
>> On Tue, Mar 29, 2016 at 10:33 AM, Matt Burgess 
>> wrote:
>>
>> > I like option 1 as well.
>> >
>> > In the case where a fix is to be put into both branches, will the
>> developer
>> > be responsible for issuing 2 PRs / patches, one against each branch?
>> This
>> > would help in the case that the PR/patch against 0.x won't merge cleanly
>> > into master; however the reviewer(s) would need to make sure there were
>> no
>> > breaking changes as a result of the manual merge to master. An
>> alternative
>> > is that the reviewer(s) do the forward-port, which I don't think is a
>> good
>> > idea. However the reviewer would need to make sure the PR(s) are against
>> > the correct branch. For example, all current PRs would need to be
>> > "backported" to the new 0.x branch.
>> >
>> > Also, I would think the PRs/patches need to be merged at the same time
>> (or
>> > soon), to avoid regressions (i.e. a bug fix going into 0.x but getting
>> > forgotten/missed for 1.x).
>> >
>> > Thoughts? Thanks,
>> > Matt
>> >
>> > On Tue, Mar 29, 2016 at 10:26 AM, Joe Witt  wrote:
>> >
>> > > I too prefer option 1
>> > >
>> > > On Tue, Mar 29, 2016 at 8:21 AM, Brandon DeVries  wrote:
>> > > > I agree with Tony on  option 1.  I think it makes sense for master
>> to
>> > be
>> > > > the most "advanced" branch.  New features will then always be
>> applied
>> > to
>> > > > master, and optionally to other branches for older version support
>> as
>> > > > applicable / desired.
>> > > >
>> > > > On Tue, Mar 29, 2016 at 10:16 AM Tony Kurc 
>> wrote:
>> > > >
>> > > >> I like option 1
>> > > >> On Mar 29, 2016 10:03 AM, "Matt Gilman" 
>> > > wrote:
>> > > >>
>> > > >> > Hello,
>> > > >> >
>> > > >> > With NiFi 0.6.0 officially released and our support strategy
>> defined
>> > > [1],
>> > > >> > I'd like to revisit and propose some options for supporting both
>> a
>> > 1.x
>> > > >> > branch and 0.x branch concurrently. We need an official place
>> where
>> > > these
>> > > >> > efforts can be worked, contributed to, and collaborated with the
>> > > >> community.
>> > > >> > I've already cre

[GitHub] nifi pull request: NIFI-483: Use ZooKeeper Leader Election to Auto...

2016-03-31 Thread mcgilman
Github user mcgilman commented on the pull request:

https://github.com/apache/nifi/pull/301#issuecomment-204093602
  
Functionality looks good. Was able to stop nodes and have the primary node 
role automatically reassign to a new node. +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Text and metadata extraction processor

2016-03-31 Thread Simon Ball
Just a thought… 

To keep consistent with other Nifi Parse patterns, would it make sense to based 
the extraction of content on the presence of a relation. So your tika processor 
would have an original relation which would have meta data attached as 
attributed, and an extracted relation which would have the metadata and the 
processed content (text from OCRed image for example). That way you can just 
use context.hasConnection(relationship) to determine whether to enable the tika 
content processing.

This seems more idiomatic than a mode flag. 

Simon

> On 31 Mar 2016, at 19:48, Joe Skora  wrote:
> 
> Dmitry,
> 
> I think we're good.  I was confused because "XXX_METADATA MIMETYPE FILTER"
> entries referred to some MIME type of the metadata, but you meant to use
> the file's MIME type to select what files have metadata extracted.
> 
> Sorry, about that, I think we are on the same page.
> 
> Joe
> 
> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
> dgoldenb...@hexastax.com> wrote:
> 
>> Hi Joe,
>> 
>> I think if we have the filters in place then there's no need for the 'mode'
>> enum, as the filters themselves guide the processor in deciding whether
>> metadata and/or content is extracted for a given input file.
>> 
>> Agreed on the handling of archives as a separate processor (template, seems
>> like).
>> 
>> I think it's easiest to do both metadata and/or content in one processor
>> since it can tell Tika whether to extract metadata and/or content, in one
>> pass over the file bytes (as you pointed out).
>> 
>> Agreed on the exclusions trumping inclusions; I think that makes sense.
>> 
 We will only have a mimetype for the original flow file itself so I'm
>> not sure about the metadata mimetype filter.
>> 
>> I'm not sure where there might be an issue here. The metadata MIME type
>> filter tells the processor for which MIME types to perform the metadata
>> extraction.  For instance, extract metadata for images and videos, only.
>> This could possibly be coupled with an exclusion filter for content that
>> says, don't try to extract content from images and videos.
>> 
>> I think with the six filters we get all the bases covered:
>> 
>>   1. include metadata? --
>>  1. yes --
>> 1. determine the inclusion of metadata by filename pattern
>> 2. determine the inclusion of metadata by MIME type pattern
>>  2. no --
>> 1. determine the exclusion of metadata by filename pattern
>> 2. determine the exclusion of metadata by MIME type pattern
>>  2. include content? --
>>  1. yes --
>> 1. determine the inclusion of content by filename pattern
>> 2. determine the inclusion of content by MIME type pattern
>>  2. no --
>> 1. determine the exclusion of content by filename pattern
>> 2. determine the exclusion of content by MIME type pattern
>> 
>> Does this work?
>> 
>> Thanks,
>> - Dmitry
>> 
>> 
>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora  wrote:
>> 
>>> Dmitry,
>>> 
>>> Looking at this and your prior email.
>>> 
>>> 
>>>   1. I can see "extract metadata only" being as popular as "extract
>>>   metadata and content".  It will all depend on the type of media, for
>>>   audio/video files adding the metadata to the flow file is enough but
>> for
>>>   Word, PDF, etc. files the content may be wanted as well.
>>>   2. After thinking about it, I agree on an enum for mode.
>>>   3. I think any handling of zips or archive files should be handled by
>>>   another processor, that keeps this processor cleaner and improves its
>>>   ability for re-use.
>>>   4. I like the addition of exclude filters but I'm not sure about
>> adding
>>>   content filters.  We will only have a mimetype for the original flow
>>> file
>>>   itself so I'm not sure about the metadata mimetype filter.  I think
>>> content
>>>   filtering may be best left for another downstream processor, but it
>>> might
>>>   be run faster if included here since the entire content will be
>> handled
>>>   during extraction.  If the content filters are implemented, for
>>> performance
>>>   they need to short circuit so that if the property is not set or is
>> set
>>> to
>>>   ".*" they don't evaluate the regex.
>>>   1. FILENAME_FILTER - selects flow files to process based on filename
>>>  matching regex. (exists)
>>>  2. MIMETYPE_FILTER - selects flow files to process based on
>> mimetype
>>>  matching regex. (exists)
>>>  3. FILENAME_EXCLUDE - excludes already selected flow files from
>>>  processing based on filename matching regex. (new)
>>>  4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>>>  processing based on mimetype matching regex. (new)
>>>  5. CONTENT_FILTER (optional) - selects flow files for output based
>> on
>>>  extracted content matching regex. (new)
>>>  6. CONTENT_EXCLUDE (optional) - excludes flow files from output
>> based
>>>  on extracted content matching regex. (new)
>>>   5. As indica

Re: Can't connect to Secure HBase cluster

2016-03-31 Thread Bryan Bende
>From doing some Googling it seems like the problem is similar to the one
described here where Hive could no longer talk to HBase after installing
Phoenix:
https://community.hortonworks.com/questions/1652/how-can-i-query-hbase-from-hive.html

The solution in that scenario was to add a phoenix jar to Hive's classpath,
which makes me think we would somehow have to make a phoenix jar available
on NiFi's classpath for the HBase Client Service.

I don't know enough about Phoenix to say for sure, but I created this JIRA
to capture the issue:
https://issues.apache.org/jira/browse/NIFI-1712


On Thu, Mar 31, 2016 at 2:28 PM, Guillaume Pool  wrote:

> Hi,
>
>
>
> Yes, here it is
>
>
>
>   
>
>
>
> 
>
>   fs.defaultFS
>
>   hdfs://supergrpcluster
>
> 
>
>
>
> 
>
>   fs.trash.interval
>
>   360
>
> 
>
>
>
> 
>
>
> ha.failover-controller.active-standby-elector.zk.op.retries
>
>   120
>
> 
>
>
>
> 
>
>   ha.zookeeper.quorum
>
>   sv-htndp2.hdp.supergrp.net:2181,
> sv-htndp1.hdp.supergrp.net:2181,sv-htndp3.hdp.supergrp.net:2181
>
> 
>
>
>
> 
>
>   hadoop.http.authentication.simple.anonymous.allowed
>
>   true
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.admin.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.admin.hosts
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hcat.groups
>
>   users
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hcat.hosts
>
>   sv-htnmn2.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hdfs.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hdfs.hosts
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hive.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.hive.hosts
>
>   sv-htnmn2.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.HTTP.groups
>
>   users
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.HTTP.hosts
>
>   sv-htnmn2.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.knox.groups
>
>   users
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.knox.hosts
>
>   sv-htncmn.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.oozie.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.oozie.hosts
>
>   sv-htncmn.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.root.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.root.hosts
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.yarn.groups
>
>   *
>
> 
>
>
>
> 
>
>   hadoop.proxyuser.yarn.hosts
>
>   sv-htnmn1.hdp.supergrp.net
>
> 
>
>
>
> 
>
>   hadoop.security.auth_to_local
>
>   RULE:[1:$1@$0](ambari...@hdp.supergrp.net)s/.*/ambari-qa/
>
> RULE:[1:$1@$0](hb...@hdp.supergrp.net)s/.*/hbase/
>
> RULE:[1:$1@$0](h...@hdp.supergrp.net)s/.*/hdfs/
>
> RULE:[1:$1@$0](sp...@hdp.supergrp.net)s/.*/spark/
>
> RULE:[1:$1@$0](.*@HDP.SUPERGRP.NET)s/@.*//
>
> RULE:[2:$1@$0](amshb...@hdp.supergrp.net)s/.*/ams/
>
> RULE:[2:$1@$0](amshbasemas...@hdp.supergrp.net)s/.*/ams/
>
> RULE:[2:$1@$0](amshbas...@hdp.supergrp.net)s/.*/ams/
>
> RULE:[2:$1@$0](am...@hdp.supergrp.net)s/.*/ams/
>
> RULE:[2:$1@$0](d...@hdp.supergrp.net)s/.*/hdfs/
>
> RULE:[2:$1@$0](hb...@hdp.supergrp.net)s/.*/hbase/
>
> RULE:[2:$1@$0](h...@hdp.supergrp.net)s/.*/hive/
>
> RULE:[2:$1@$0](j...@hdp.supergrp.net)s/.*/mapred/
>
> RULE:[2:$1@$0](j...@hdp.supergrp.net)s/.*/hdfs/
>
> RULE:[2:$1@$0](k...@hdp.supergrp.net)s/.*/knox/
>
> RULE:[2:$1@$0](n...@hdp.supergrp.net)s/.*/yarn/
>
> RULE:[2:$1@$0](n...@hdp.supergrp.net)s/.*/hdfs/
>
> RULE:[2:$1@$0](oo...@hdp.supergrp.net)s/.*/oozie/
>
> RULE:[2:$1@$0](r...@hdp.supergrp.net)s/.*/yarn/
>
> RULE:[2:$1@$0](y...@hdp.supergrp.net)s/.*/yarn/
>
> DEFAULT
>
> 
>
>
>
> 
>
>   hadoop.security.authentication
>
>   kerberos
>
> 
>
>
>
> 
>
>   hadoop.security.authorization
>
>   true
>
> 
>
>
>
> 
>
>   hadoop.security.key.provider.path
>
>   
>
> 
>
>
>
> 
>
>   io.compression.codecs
>
>
> org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec
>
> 
>
>
>
> 
>
>   io.file.buffer.size
>
>   131072
>
> 
>
>
>
> 
>
>   io.serializations
>
>   org.apache.hadoop.io.serializer.WritableSerialization
>
> 
>
>
>
> 
>
>   ipc.client.connect.max.retries
>
>   50
>
> 
>
>
>
> 
>
>   ipc.client.connection.maxidletime
>
>   3
>
> 
>
>
>
> 
>
>   ipc.client.idlethreshold
>
>   8000
>
> 
>
>
>
> 
>
>   ipc.server.tcpnodelay
>
>   true
>
> 
>
>
>
> 
>
>   mapreduce.jobtracker.webinterface.trusted
>
>   false
>
> 
>
>
>
> 
>
>   net.topology.script.file.name
>
>   /etc/hadoop/conf/topology_script.py
>
> 
>
>
>
>   
>
>
>
> Thanks
>
>
>
> *From: *Jeff Lord 
> *Sent: *Thursday, 31 March

Re: Text and metadata extraction processor

2016-03-31 Thread Joe Skora
Dmitry,

I think we're good.  I was confused because "XXX_METADATA MIMETYPE FILTER"
entries referred to some MIME type of the metadata, but you meant to use
the file's MIME type to select what files have metadata extracted.

Sorry, about that, I think we are on the same page.

Joe

On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
dgoldenb...@hexastax.com> wrote:

> Hi Joe,
>
> I think if we have the filters in place then there's no need for the 'mode'
> enum, as the filters themselves guide the processor in deciding whether
> metadata and/or content is extracted for a given input file.
>
> Agreed on the handling of archives as a separate processor (template, seems
> like).
>
> I think it's easiest to do both metadata and/or content in one processor
> since it can tell Tika whether to extract metadata and/or content, in one
> pass over the file bytes (as you pointed out).
>
> Agreed on the exclusions trumping inclusions; I think that makes sense.
>
> >> We will only have a mimetype for the original flow file itself so I'm
> not sure about the metadata mimetype filter.
>
> I'm not sure where there might be an issue here. The metadata MIME type
> filter tells the processor for which MIME types to perform the metadata
> extraction.  For instance, extract metadata for images and videos, only.
> This could possibly be coupled with an exclusion filter for content that
> says, don't try to extract content from images and videos.
>
> I think with the six filters we get all the bases covered:
>
>1. include metadata? --
>   1. yes --
>  1. determine the inclusion of metadata by filename pattern
>  2. determine the inclusion of metadata by MIME type pattern
>   2. no --
>  1. determine the exclusion of metadata by filename pattern
>  2. determine the exclusion of metadata by MIME type pattern
>   2. include content? --
>   1. yes --
>  1. determine the inclusion of content by filename pattern
>  2. determine the inclusion of content by MIME type pattern
>   2. no --
>  1. determine the exclusion of content by filename pattern
>  2. determine the exclusion of content by MIME type pattern
>
> Does this work?
>
> Thanks,
> - Dmitry
>
>
> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora  wrote:
>
> > Dmitry,
> >
> > Looking at this and your prior email.
> >
> >
> >1. I can see "extract metadata only" being as popular as "extract
> >metadata and content".  It will all depend on the type of media, for
> >audio/video files adding the metadata to the flow file is enough but
> for
> >Word, PDF, etc. files the content may be wanted as well.
> >2. After thinking about it, I agree on an enum for mode.
> >3. I think any handling of zips or archive files should be handled by
> >another processor, that keeps this processor cleaner and improves its
> >ability for re-use.
> >4. I like the addition of exclude filters but I'm not sure about
> adding
> >content filters.  We will only have a mimetype for the original flow
> > file
> >itself so I'm not sure about the metadata mimetype filter.  I think
> > content
> >filtering may be best left for another downstream processor, but it
> > might
> >be run faster if included here since the entire content will be
> handled
> >during extraction.  If the content filters are implemented, for
> > performance
> >they need to short circuit so that if the property is not set or is
> set
> > to
> >".*" they don't evaluate the regex.
> >1. FILENAME_FILTER - selects flow files to process based on filename
> >   matching regex. (exists)
> >   2. MIMETYPE_FILTER - selects flow files to process based on
> mimetype
> >   matching regex. (exists)
> >   3. FILENAME_EXCLUDE - excludes already selected flow files from
> >   processing based on filename matching regex. (new)
> >   4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
> >   processing based on mimetype matching regex. (new)
> >   5. CONTENT_FILTER (optional) - selects flow files for output based
> on
> >   extracted content matching regex. (new)
> >   6. CONTENT_EXCLUDE (optional) - excludes flow files from output
> based
> >   on extracted content matching regex. (new)
> >5. As indicated in the descriptions in #4, I don't think overlapping
> >filters are an error, instead excludes should take precedence over
> >includes.  Then I can include a domain (like A*) but exclude sub-sets
> > (like
> >AXYZ*).
> >
> > I'm sure there's something we missed, but I think that covers most of it.
> >
> > Regards,
> > Joe
> >
> >
> > On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> > dgoldenb...@hexastax.com
> > > wrote:
> >
> > > Joe,
> > >
> > > Upon some thinking, I've started wondering whether all the cases can be
> > > covered by the following filters:
> > >
> > > INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which in

[GitHub] nifi pull request: NIFI-1710 Resolve path name to nifi.sh in start...

2016-03-31 Thread busbey
Github user busbey commented on the pull request:

https://github.com/apache/nifi/pull/315#issuecomment-204066394
  
Please do not attempt to use python as a substitute for OS shell scripts. 
It's a path to ruin, the mechanics of properly managing things on Bash-having 
systems and Windows are fundamentally different (until we can just switch to 
bash-only for windows 10).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1710 Resolve path name to nifi.sh in start...

2016-03-31 Thread arpitgupta
Github user arpitgupta commented on the pull request:

https://github.com/apache/nifi/pull/315#issuecomment-204061845
  
what about moving to python2.4 for these scripts. That way we get a unified 
file for linux and windows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


RE: Can't connect to Secure HBase cluster

2016-03-31 Thread Guillaume Pool
Hi,

The HBase_Client controller service can’t be enabled, so cant even start that 
component.

Here is my hbase-site.xml

  


  dfs.domain.socket.path
  /var/lib/hadoop-hdfs/dn_socket



  hbase.bucketcache.ioengine
  offheap



  hbase.bucketcache.percentage.in.combinedcache
  



  hbase.bucketcache.size
  2048



  hbase.bulkload.staging.dir
  /apps/hbase/staging



  hbase.client.keyvalue.maxsize
  1048576



  hbase.client.retries.number
  35



  hbase.client.scanner.caching
  100



  hbase.cluster.distributed
  true



  hbase.coprocessor.master.classes
  
org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor



  hbase.coprocessor.region.classes
org.apache.hadoop.hbase.security.token.TokenProvider,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor



  hbase.coprocessor.regionserver.classes
  
org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor



  hbase.defaults.for.version.skip
  true



  hbase.hregion.majorcompaction
  60480



  hbase.hregion.majorcompaction.jitter
  0.50



 hbase.hregion.max.filesize
  10737418240



  hbase.hregion.memstore.block.multiplier
  4



  hbase.hregion.memstore.flush.size
  134217728



  hbase.hregion.memstore.mslab.enabled
  true



  hbase.hstore.blockingStoreFiles
  10



  hbase.hstore.compaction.max
  10



  hbase.hstore.compactionThreshold
  3



  hbase.local.dir
  ${hbase.tmp.dir}/local



  hbase.master.info.bindAddress
  0.0.0.0



  hbase.master.info.port
  16010



  hbase.master.kerberos.principal
  hbase/_h...@hdp.supergrp.net



  hbase.master.keytab.file
  /etc/security/keytabs/hbase.service.keytab



  hbase.master.port
  16000



  hbase.region.server.rpc.scheduler.factory.class
  org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory



  hbase.regionserver.global.memstore.size
  0.4



  hbase.regionserver.handler.count
  30



  hbase.regionserver.info.port
  16030



  hbase.regionserver.kerberos.principal
  hbase/_h...@hdp.supergrp.net



  hbase.regionserver.keytab.file
  /etc/security/keytabs/hbase.service.keytab



  hbase.regionserver.port
  16020



  hbase.regionserver.wal.codec
  
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec



  hbase.rootdir
  hdfs://hdpcluster/apps/hbase/data



  hbase.rpc.controllerfactory.class
  
org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory



  hbase.rpc.protection
  authentication



  hbase.rpc.timeout
  9



  hbase.security.authentication
  kerberos



  hbase.security.authorization
  true



  hbase.superuser
  hbase



  hbase.tmp.dir
  /hadoop/hbase



  hbase.zookeeper.property.clientPort
  2181



  hbase.zookeeper.quorum
  sv-htndp1,sv-htndp2,sv-htndp3



  hbase.zookeeper.useMulti
  true



  hfile.block.cache.size
  0.40



  phoenix.functions.allowUserDefinedFunctions
  true



  phoenix.query.timeoutMs
  6



  phoenix.queryserver.kerberos.principal
  hbase/_h...@hdp.supergrp.net



  phoenix.queryserver.keytab.file
  /etc/security/keytabs/hbase.service.keytab



  zookeeper.session.timeout
  9



  zookeeper.znode.parent
  /hbase-secure


I use GetHBase and PutHBase processors in the flow but not even getting a 
connection from the controller service.

Thanks


From: Bryan Bende
Sent: Thursday, 31 March 2016 05:55 PM
To: Guillaume Pool; 
dev@nifi.apache.org
Subject: Re: Can't connect to Secure HBase cluster

Ok so it is not the Kerberos authentication that is causing the problem.

Would you be able to share a template of your flow?
If you are not familiar with templates, they are described here [1]. You
can paste the XML of the template on a gist [2] as an easy way to share it.

If you can't share a template, then can you tell us if anything else is
going on in your flow, any other processors being used?

Thanks,

Bryan

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates
[2] https://gist.github.com/

On Thu, Mar 31, 2016 at 10:59 AM, Guillaume Pool  wrote:

> Hi,
>
>
>
> Yes, I can c

[GitHub] nifi pull request: NIFI-1711 Client-side JS for proxy-friendly URL...

2016-03-31 Thread jvwing
GitHub user jvwing opened a pull request:

https://github.com/apache/nifi/pull/316

NIFI-1711 Client-side JS for proxy-friendly URLs

This is a possible fix for generating proxy-friendly URLs in the content 
viewer using client-side Javascript rather than calculating the URLs 
server-side using the proxy headers.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jvwing/nifi 
NIFI-1711-content-viewer-format-urls

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/316.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #316


commit cb1fec4ecf883732e3e80eddb0a0bae99d7f
Author: James Wing 
Date:   2016-03-31T17:58:02Z

NIFI-1711 Client-side JS for proxy-friendly URLs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Variable FlowFile Attributes Defined by FlowFile Content

2016-03-31 Thread dale.chang13
I see that the ExtractText Processor performs regular expressions on the
FlowFile's content and adds the results as user-defined attributes. However,
I was wondering if there was a way to avoid "hard-coding" these values. I
was hoping of something along the lines where the key literal of the
FlowFile attributes were defined in the FlowFile content.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Variable-FlowFile-Attributes-Defined-by-FlowFile-Content-tp8682.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


[GitHub] nifi pull request: NIFI-1710 Resolve path name to nifi.sh in start...

2016-03-31 Thread jfrazee
Github user jfrazee commented on the pull request:

https://github.com/apache/nifi/pull/315#issuecomment-204017042
  
@arpitgupta Ok, yeah, older versions of coreutils don't have it. I was 
doing this with `readlink` but BSD readlink doesn't have the same options as 
GNU. Looks like it might be necessary to push this down into detectOS(), if 
people want this sort of change at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Import Kafka messages into Titan

2016-03-31 Thread idioma
Thanks once again Matt, but I wonder whether we can make it even easier.

GetKafka -> Custom Processor that will use the GraphSON Reader lib
(https://github.com/tinkerpop/blueprints/wiki/GraphSON-Reader-and-Writer-Library)
-> Custom PutTitan Processor that will insert the graph into Titan. Does it
actually sound reasonable? What are your thoughts? 

So, I will have two processors in a separation of concerns fashion : one to
read JSON to GraphSon and one that will only point to my processor's Module
Path property (how do you do that?) and insert the data. 

Thanks,

Ilaria



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647p8680.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Import Kafka messages into Titan

2016-03-31 Thread Bryan Bende
For #2,  you can use templates to move the flow (or parts of it) to another
instance.
A possible approach is to organize the flow into process groups and create
a template per process group, making it potentially easier to update parts
of the flow independently.

This project might be helpful to look at in terms of automating deploying a
template from one instance to another:
https://github.com/aperepel/nifi-api-deploy

For properties that are environment specific, if the property supports
expression language, you can specify them in bootstrap.conf as -D
properties for each of your NiFi instances, and in your processors you can
reference them with Expression Language.
For example, in each bootstrap.conf there could be -Dkafka.topic=mytopic
and then in a PutKafka processor set the topic to ${kafka.topic}. This will
let your template be the same for each environment.
Unfortunately at a quick glance it looks like GetKafka topic name does not
support EL, which should probably be fixed to allow this.

In the future there is a plan to have a variable registry exposed through
the UI so that you wouldn't have to edit the bootstrap file to define these
properties.


On Thu, Mar 31, 2016 at 11:58 AM, Matt Burgess  wrote:

> I'll let someone else have a go at question 2 :)
>
> If you're using ExecuteScript with Groovy, you don't need
> EvaluateJsonPath, Groovy has a JSONSlurper that works nicely (examples on
> my blog).
>
> To put directly into Titan you don't need to convert the format, instead
> you'll want Gremlin (part of Apache Tinkerpop), point your processor's
> Module Path property at a folder containing the Gremlin JARs, then you can
> create the vertices and edges using the approach in the Titan documentation.
>
> This would make an excellent blog post, perhaps I'll give this a try
> myself but please feel welcome to share anything you learn along the way!
> If I get some spare time I'd like to write a PutGraph processor that does
> pretty much what we've outlined here.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> > On Mar 31, 2016, at 10:15 AM, idioma  wrote:
> >
> > Matt, thank you for this this is brilliant. So, as it is I am thinking
> that I
> > would like to use the following:
> >
> > GetKafka -> EvaluateJsonPath -> ExecuteScript+Groovy Script
> >
> > My questions are two:
> >
> > 1) How do I import the Titan-compliant file into Titan? I guess I can
> modify
> > the script and load it into it.
> > 2) my second quest is more naive and proves the fact my background is
> more
> > on Apache Camel with very little knowledge of NiFi. In a versionControl
> > environment, how do you push a process flow created with NiFi that mostly
> > involves standard components? Do you write customized version where you
> set
> > of kafka properties, for example?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647p8667.html
> > Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Jennifer Barnabee
+1

On Thu, Mar 31, 2016 at 10:29 AM, Joe Witt  wrote:

> Team,
>
> I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> important findings we've made and put JIRAs/PRs in for over the past
> couple of days.  The most concerning is that I believe we have a
> potential data loss issue introduced in PutKafka.  The general logic
> of that processor is far better now but the handling of stream
> demarcation is broken.  That is now fixed from NIFI-1701.
>
> I am happy to RM this if folks are in agreement.
>
> Thanks
> Joe
>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Tony Kurc
thanks aldrin.

On Thu, Mar 31, 2016 at 11:44 AM, Aldrin Piri  wrote:

> Updated the downloads page and pushed out changes with the language you
> provided above.
>
> On Thu, Mar 31, 2016 at 11:27 AM, Aldrin Piri 
> wrote:
>
> > Will take care of it.
> >
> > On Thu, Mar 31, 2016 at 11:20 AM, Joe Witt  wrote:
> >
> >> We should also update the downloads page to ensure folks are aware of
> >> this issue.
> >>
> >> I recommend phrasing such as:
> >> "Before downloading this version please see the release notes [link].
> >> There is a regression with PutKafka that can cause data loss when
> >> sending to Kafka.  We will have an 0.6.1 release very soon to address
> >> that finding."
> >>
> >> I can't do it myself right now but if someone else can that would be
> >> really helpful.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, Mar 31, 2016 at 10:57 AM, Oleg Zhurakousky
> >>  wrote:
> >> > +1
> >> >> On Mar 31, 2016, at 10:54 AM, Joe Witt  wrote:
> >> >>
> >> >> I have updated the release notes to reflect that we will produce an
> >> >> 0.6.1 to correct the PutKafka problem
> >> >>
> >> >> https://cwiki.apache.org/confluence/display/NIFI/Release+Notes
> >> >>
> >> >> I will start the RM work on this assuming lazy consensus though
> >> >> clearly there is a good sign of consensus already
> >> >>
> >> >> On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc 
> wrote:
> >> >>> +1
> >> >>>
> >> >>> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne 
> >> wrote:
> >> >>>
> >>  +1 - definitely agree that it warrants a patch release.
> >> 
> >> > On Mar 31, 2016, at 10:29 AM, Joe Witt 
> wrote:
> >> >
> >> > Team,
> >> >
> >> > I propose that we do an Apache NiFi 0.6.1 release.  There are a
> few
> >> > important findings we've made and put JIRAs/PRs in for over the
> past
> >> > couple of days.  The most concerning is that I believe we have a
> >> > potential data loss issue introduced in PutKafka.  The general
> logic
> >> > of that processor is far better now but the handling of stream
> >> > demarcation is broken.  That is now fixed from NIFI-1701.
> >> >
> >> > I am happy to RM this if folks are in agreement.
> >> >
> >> > Thanks
> >> > Joe
> >> 
> >> 
> >> >>
> >> >
> >>
> >
> >
>


[GitHub] nifi pull request: NIFI-1710 Resolve path name to nifi.sh in start...

2016-03-31 Thread arpitgupta
Github user arpitgupta commented on the pull request:

https://github.com/apache/nifi/pull/315#issuecomment-204001765
  
I cant find packages for centos 6 that will install this in the os repos, 
epel repos etc. Have not checked other OS's. I upgraded coreutils and still did 
not get this command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Import Kafka messages into Titan

2016-03-31 Thread Matt Burgess
I'll let someone else have a go at question 2 :)

If you're using ExecuteScript with Groovy, you don't need EvaluateJsonPath, 
Groovy has a JSONSlurper that works nicely (examples on my blog).

To put directly into Titan you don't need to convert the format, instead you'll 
want Gremlin (part of Apache Tinkerpop), point your processor's Module Path 
property at a folder containing the Gremlin JARs, then you can create the 
vertices and edges using the approach in the Titan documentation.

This would make an excellent blog post, perhaps I'll give this a try myself but 
please feel welcome to share anything you learn along the way! If I get some 
spare time I'd like to write a PutGraph processor that does pretty much what 
we've outlined here.

Regards,
Matt

Sent from my iPhone

> On Mar 31, 2016, at 10:15 AM, idioma  wrote:
> 
> Matt, thank you for this this is brilliant. So, as it is I am thinking that I
> would like to use the following:
> 
> GetKafka -> EvaluateJsonPath -> ExecuteScript+Groovy Script
> 
> My questions are two:
> 
> 1) How do I import the Titan-compliant file into Titan? I guess I can modify
> the script and load it into it.
> 2) my second quest is more naive and proves the fact my background is more
> on Apache Camel with very little knowledge of NiFi. In a versionControl
> environment, how do you push a process flow created with NiFi that mostly
> involves standard components? Do you write customized version where you set
> of kafka properties, for example? 
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647p8667.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Can't connect to Secure HBase cluster

2016-03-31 Thread Bryan Bende
Ok so it is not the Kerberos authentication that is causing the problem.

Would you be able to share a template of your flow?
If you are not familiar with templates, they are described here [1]. You
can paste the XML of the template on a gist [2] as an easy way to share it.

If you can't share a template, then can you tell us if anything else is
going on in your flow, any other processors being used?

Thanks,

Bryan

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates
[2] https://gist.github.com/

On Thu, Mar 31, 2016 at 10:59 AM, Guillaume Pool  wrote:

> Hi,
>
>
>
> Yes, I can connect using that user.
>
>
>
> Had to test it on HBase master as HBase not installed on NiFi server.
>
>
>
> Regards
>
> Guillaume
>
>
>
> Sent from Mail  for
> Windows 10
>
>
>
> *From: *Bryan Bende 
> *Sent: *Thursday, 31 March 2016 03:12 PM
> *To: *us...@nifi.apache.org
> *Subject: *Re: Can't connect to Secure HBase cluster
>
>
> Hello,
>
> In order to narrow down the problem, can you connect to the Hbase shell
> from the command line using the same keytab and principal?
>
> kinit -kt /app/env/nifi.keytab  n...@hdp.supergrp.net
> hbase shell
>
> Then scan a table or some operation. If that all works, then we need to
> find out why you are getting UnsupportedOperationException:
> org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory.
>
> Would you be able to share a template of your flow with us?
>
> Thanks,
>
> Bryan
>
> On Thu, Mar 31, 2016 at 7:27 AM, Guillaume Pool  wrote:
>
> Hi,
>
>
>
> I am trying to make a connection to a secured cluster that has phoenix
> installed.
>
>
>
> I am running HDP 2.3.2 and NiFi 0.6.0
>
>
>
> Getting the following error on trying to enable HBase_1_1_2_ClientService
>
>
>
> 2016-03-31 13:24:23,916 INFO [StandardProcessScheduler Thread-5]
> o.a.nifi.hbase.HBase_1_1_2_ClientService
> HBase_1_1_2_ClientService[id=e7e9b2ed-d336-34be-acb4-6c8b60c735c2] HBase
> Security Enabled, logging in as principal n...@hdp.supergrp.net with
> keytab /app/env/nifi.keytab
>
> 2016-03-31 13:24:23,984 WARN [StandardProcessScheduler Thread-5]
> org.apache.hadoop.util.NativeCodeLoader Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 2016-03-31 13:24:24,101 INFO [StandardProcessScheduler Thread-5]
> o.a.nifi.hbase.HBase_1_1_2_ClientService
> HBase_1_1_2_ClientService[id=e7e9b2ed-d336-34be-acb4-6c8b60c735c2]
> Successfully logged in as principal n...@hdp.supergrp.net with keytab
> /app/env/nifi.keytab
>
> 2016-03-31 13:24:24,177 ERROR [StandardProcessScheduler Thread-5]
> o.a.n.c.s.StandardControllerServiceNode
> HBase_1_1_2_ClientService[id=e7e9b2ed-d336-34be-acb4-6c8b60c735c2] Failed
> to invoke @OnEnabled method due to java.io.IOException:
> java.lang.reflect.InvocationTargetException
>
> 2016-03-31 13:24:24,182 ERROR [StandardProcessScheduler Thread-5]
> o.a.n.c.s.StandardControllerServiceNode
>
> java.io.IOException: java.lang.reflect.InvocationTargetException
>
> at
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
> ~[hbase-client-1.1.2.jar:1.1.2]
>
> at
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
> ~[hbase-client-1.1.2.jar:1.1.2]
>
> at
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
> ~[hbase-client-1.1.2.jar:1.1.2]
>
> at
> org.apache.nifi.hbase.HBase_1_1_2_ClientService$1.run(HBase_1_1_2_ClientService.java:215)
> ~[nifi-hbase_1_1_2-client-service-0.6.0.jar:0.6.0]
>
> at
> org.apache.nifi.hbase.HBase_1_1_2_ClientService$1.run(HBase_1_1_2_ClientService.java:212)
> ~[nifi-hbase_1_1_2-client-service-0.6.0.jar:0.6.0]
>
> at java.security.AccessController.doPrivileged(Native Method)
> ~[na:1.8.0_71]
>
> at javax.security.auth.Subject.doAs(Subject.java:422)
> ~[na:1.8.0_71]
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
> ~[hadoop-common-2.6.2.jar:na]
>
> at
> org.apache.nifi.hbase.HBase_1_1_2_ClientService.createConnection(HBase_1_1_2_ClientService.java:212)
> ~[nifi-hbase_1_1_2-client-service-0.6.0.jar:0.6.0]
>
> at
> org.apache.nifi.hbase.HBase_1_1_2_ClientService.onEnabled(HBase_1_1_2_ClientService.java:161)
> ~[nifi-hbase_1_1_2-client-service-0.6.0.jar:0.6.0]
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.8.0_71]
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:1.8.0_71]
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.8.0_71]
>
> at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_71]
>
> at
> org.apache.nifi.util.ReflectionUtils.invokeMethodsWithAnnotations(ReflectionUtils.java:137)
> ~[na:na]
>
> at
> org.apache.nifi

Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Aldrin Piri
Updated the downloads page and pushed out changes with the language you
provided above.

On Thu, Mar 31, 2016 at 11:27 AM, Aldrin Piri  wrote:

> Will take care of it.
>
> On Thu, Mar 31, 2016 at 11:20 AM, Joe Witt  wrote:
>
>> We should also update the downloads page to ensure folks are aware of
>> this issue.
>>
>> I recommend phrasing such as:
>> "Before downloading this version please see the release notes [link].
>> There is a regression with PutKafka that can cause data loss when
>> sending to Kafka.  We will have an 0.6.1 release very soon to address
>> that finding."
>>
>> I can't do it myself right now but if someone else can that would be
>> really helpful.
>>
>> Thanks
>> Joe
>>
>> On Thu, Mar 31, 2016 at 10:57 AM, Oleg Zhurakousky
>>  wrote:
>> > +1
>> >> On Mar 31, 2016, at 10:54 AM, Joe Witt  wrote:
>> >>
>> >> I have updated the release notes to reflect that we will produce an
>> >> 0.6.1 to correct the PutKafka problem
>> >>
>> >> https://cwiki.apache.org/confluence/display/NIFI/Release+Notes
>> >>
>> >> I will start the RM work on this assuming lazy consensus though
>> >> clearly there is a good sign of consensus already
>> >>
>> >> On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc  wrote:
>> >>> +1
>> >>>
>> >>> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne 
>> wrote:
>> >>>
>>  +1 - definitely agree that it warrants a patch release.
>> 
>> > On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
>> >
>> > Team,
>> >
>> > I propose that we do an Apache NiFi 0.6.1 release.  There are a few
>> > important findings we've made and put JIRAs/PRs in for over the past
>> > couple of days.  The most concerning is that I believe we have a
>> > potential data loss issue introduced in PutKafka.  The general logic
>> > of that processor is far better now but the handling of stream
>> > demarcation is broken.  That is now fixed from NIFI-1701.
>> >
>> > I am happy to RM this if folks are in agreement.
>> >
>> > Thanks
>> > Joe
>> 
>> 
>> >>
>> >
>>
>
>


Re: Text and metadata extraction processor

2016-03-31 Thread Dmitry Goldenberg
Hi Joe,

I think if we have the filters in place then there's no need for the 'mode'
enum, as the filters themselves guide the processor in deciding whether
metadata and/or content is extracted for a given input file.

Agreed on the handling of archives as a separate processor (template, seems
like).

I think it's easiest to do both metadata and/or content in one processor
since it can tell Tika whether to extract metadata and/or content, in one
pass over the file bytes (as you pointed out).

Agreed on the exclusions trumping inclusions; I think that makes sense.

>> We will only have a mimetype for the original flow file itself so I'm
not sure about the metadata mimetype filter.

I'm not sure where there might be an issue here. The metadata MIME type
filter tells the processor for which MIME types to perform the metadata
extraction.  For instance, extract metadata for images and videos, only.
This could possibly be coupled with an exclusion filter for content that
says, don't try to extract content from images and videos.

I think with the six filters we get all the bases covered:

   1. include metadata? --
  1. yes --
 1. determine the inclusion of metadata by filename pattern
 2. determine the inclusion of metadata by MIME type pattern
  2. no --
 1. determine the exclusion of metadata by filename pattern
 2. determine the exclusion of metadata by MIME type pattern
  2. include content? --
  1. yes --
 1. determine the inclusion of content by filename pattern
 2. determine the inclusion of content by MIME type pattern
  2. no --
 1. determine the exclusion of content by filename pattern
 2. determine the exclusion of content by MIME type pattern

Does this work?

Thanks,
- Dmitry


On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora  wrote:

> Dmitry,
>
> Looking at this and your prior email.
>
>
>1. I can see "extract metadata only" being as popular as "extract
>metadata and content".  It will all depend on the type of media, for
>audio/video files adding the metadata to the flow file is enough but for
>Word, PDF, etc. files the content may be wanted as well.
>2. After thinking about it, I agree on an enum for mode.
>3. I think any handling of zips or archive files should be handled by
>another processor, that keeps this processor cleaner and improves its
>ability for re-use.
>4. I like the addition of exclude filters but I'm not sure about adding
>content filters.  We will only have a mimetype for the original flow
> file
>itself so I'm not sure about the metadata mimetype filter.  I think
> content
>filtering may be best left for another downstream processor, but it
> might
>be run faster if included here since the entire content will be handled
>during extraction.  If the content filters are implemented, for
> performance
>they need to short circuit so that if the property is not set or is set
> to
>".*" they don't evaluate the regex.
>1. FILENAME_FILTER - selects flow files to process based on filename
>   matching regex. (exists)
>   2. MIMETYPE_FILTER - selects flow files to process based on mimetype
>   matching regex. (exists)
>   3. FILENAME_EXCLUDE - excludes already selected flow files from
>   processing based on filename matching regex. (new)
>   4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>   processing based on mimetype matching regex. (new)
>   5. CONTENT_FILTER (optional) - selects flow files for output based on
>   extracted content matching regex. (new)
>   6. CONTENT_EXCLUDE (optional) - excludes flow files from output based
>   on extracted content matching regex. (new)
>5. As indicated in the descriptions in #4, I don't think overlapping
>filters are an error, instead excludes should take precedence over
>includes.  Then I can include a domain (like A*) but exclude sub-sets
> (like
>AXYZ*).
>
> I'm sure there's something we missed, but I think that covers most of it.
>
> Regards,
> Joe
>
>
> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
> dgoldenb...@hexastax.com
> > wrote:
>
> > Joe,
> >
> > Upon some thinking, I've started wondering whether all the cases can be
> > covered by the following filters:
> >
> > INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> > files get their content extracted, by file name
> > INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
> > files get their metadata extracted, by file name
> > INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> > files get their content extracted, by MIME type
> > INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
> > files get their metadata extracted, by MIME type
> >
> > EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> > files do NOT get their content extracted, by file name
> > EXCLUDE

[GitHub] nifi pull request: NIFI-1710 Resolve path name to nifi.sh in start...

2016-03-31 Thread jfrazee
GitHub user jfrazee opened a pull request:

https://github.com/apache/nifi/pull/315

NIFI-1710 Resolve path name to nifi.sh in start script

This adds `realpath` to nifi.sh to resolve location of nifi.sh when its 
symlinked (instead of installed using the `install` command). Recent versions 
of Linux and BSD make `realpath` available.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jfrazee/nifi NIFI-1710

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/315.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #315


commit dbad7f41d6177f38f3736e539e77bb05433c2773
Author: Joey Frazee 
Date:   2016-03-31T15:33:29Z

Added realpath to nifi.sh to resolve its location when symlinked




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Aldrin Piri
Will take care of it.

On Thu, Mar 31, 2016 at 11:20 AM, Joe Witt  wrote:

> We should also update the downloads page to ensure folks are aware of
> this issue.
>
> I recommend phrasing such as:
> "Before downloading this version please see the release notes [link].
> There is a regression with PutKafka that can cause data loss when
> sending to Kafka.  We will have an 0.6.1 release very soon to address
> that finding."
>
> I can't do it myself right now but if someone else can that would be
> really helpful.
>
> Thanks
> Joe
>
> On Thu, Mar 31, 2016 at 10:57 AM, Oleg Zhurakousky
>  wrote:
> > +1
> >> On Mar 31, 2016, at 10:54 AM, Joe Witt  wrote:
> >>
> >> I have updated the release notes to reflect that we will produce an
> >> 0.6.1 to correct the PutKafka problem
> >>
> >> https://cwiki.apache.org/confluence/display/NIFI/Release+Notes
> >>
> >> I will start the RM work on this assuming lazy consensus though
> >> clearly there is a good sign of consensus already
> >>
> >> On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc  wrote:
> >>> +1
> >>>
> >>> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne 
> wrote:
> >>>
>  +1 - definitely agree that it warrants a patch release.
> 
> > On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
> >
> > Team,
> >
> > I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> > important findings we've made and put JIRAs/PRs in for over the past
> > couple of days.  The most concerning is that I believe we have a
> > potential data loss issue introduced in PutKafka.  The general logic
> > of that processor is far better now but the handling of stream
> > demarcation is broken.  That is now fixed from NIFI-1701.
> >
> > I am happy to RM this if folks are in agreement.
> >
> > Thanks
> > Joe
> 
> 
> >>
> >
>


[GitHub] nifi-minifi pull request: MINIFI-10 init commit of minifi-assembly

2016-03-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi-minifi/pull/3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Import Kafka messages into Titan

2016-03-31 Thread idioma
Matt, thank you for this this is brilliant. So, as it is I am thinking that I
would like to use the following:

GetKafka -> EvaluateJsonPath -> ExecuteScript+Groovy Script

My questions are two:

1) How do I import the Titan-compliant file into Titan? I guess I can modify
the script and load it into it.
2) my second quest is more naive and proves the fact my background is more
on Apache Camel with very little knowledge of NiFi. In a versionControl
environment, how do you push a process flow created with NiFi that mostly
involves standard components? Do you write customized version where you set
of kafka properties, for example? 

Thanks



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647p8667.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Joe Witt
We should also update the downloads page to ensure folks are aware of
this issue.

I recommend phrasing such as:
"Before downloading this version please see the release notes [link].
There is a regression with PutKafka that can cause data loss when
sending to Kafka.  We will have an 0.6.1 release very soon to address
that finding."

I can't do it myself right now but if someone else can that would be
really helpful.

Thanks
Joe

On Thu, Mar 31, 2016 at 10:57 AM, Oleg Zhurakousky
 wrote:
> +1
>> On Mar 31, 2016, at 10:54 AM, Joe Witt  wrote:
>>
>> I have updated the release notes to reflect that we will produce an
>> 0.6.1 to correct the PutKafka problem
>>
>> https://cwiki.apache.org/confluence/display/NIFI/Release+Notes
>>
>> I will start the RM work on this assuming lazy consensus though
>> clearly there is a good sign of consensus already
>>
>> On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc  wrote:
>>> +1
>>>
>>> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne  wrote:
>>>
 +1 - definitely agree that it warrants a patch release.

> On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
>
> Team,
>
> I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> important findings we've made and put JIRAs/PRs in for over the past
> couple of days.  The most concerning is that I believe we have a
> potential data loss issue introduced in PutKafka.  The general logic
> of that processor is far better now but the handling of stream
> demarcation is broken.  That is now fixed from NIFI-1701.
>
> I am happy to RM this if folks are in agreement.
>
> Thanks
> Joe


>>
>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Joe Witt
I have updated the release notes to reflect that we will produce an
0.6.1 to correct the PutKafka problem

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes

I will start the RM work on this assuming lazy consensus though
clearly there is a good sign of consensus already

On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc  wrote:
> +1
>
> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne  wrote:
>
>> +1 - definitely agree that it warrants a patch release.
>>
>> > On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
>> >
>> > Team,
>> >
>> > I propose that we do an Apache NiFi 0.6.1 release.  There are a few
>> > important findings we've made and put JIRAs/PRs in for over the past
>> > couple of days.  The most concerning is that I believe we have a
>> > potential data loss issue introduced in PutKafka.  The general logic
>> > of that processor is far better now but the handling of stream
>> > demarcation is broken.  That is now fixed from NIFI-1701.
>> >
>> > I am happy to RM this if folks are in agreement.
>> >
>> > Thanks
>> > Joe
>>
>>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Oleg Zhurakousky
+1
> On Mar 31, 2016, at 10:54 AM, Joe Witt  wrote:
> 
> I have updated the release notes to reflect that we will produce an
> 0.6.1 to correct the PutKafka problem
> 
> https://cwiki.apache.org/confluence/display/NIFI/Release+Notes
> 
> I will start the RM work on this assuming lazy consensus though
> clearly there is a good sign of consensus already
> 
> On Thu, Mar 31, 2016 at 10:53 AM, Tony Kurc  wrote:
>> +1
>> 
>> On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne  wrote:
>> 
>>> +1 - definitely agree that it warrants a patch release.
>>> 
 On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
 
 Team,
 
 I propose that we do an Apache NiFi 0.6.1 release.  There are a few
 important findings we've made and put JIRAs/PRs in for over the past
 couple of days.  The most concerning is that I believe we have a
 potential data loss issue introduced in PutKafka.  The general logic
 of that processor is far better now but the handling of stream
 demarcation is broken.  That is now fixed from NIFI-1701.
 
 I am happy to RM this if folks are in agreement.
 
 Thanks
 Joe
>>> 
>>> 
> 



Re: Dynamic URLs using InvokeHttp from an array

2016-03-31 Thread Adam Taft
Yeah, these solutions won't work for thousands of iterations.  Andy's
suggestion for using ExecuteScript starts to sound very compelling,
especially if you are algorithmically generating your term values.

Another thought for you.  Uwe Geercken was experimenting with a processor
which could read in a CSV file and output a flowfile attribute for every
cell in the CSV data.  Something like this might work for you.

Basically you'd have a single column CSV file with all your terms.  For
every line in the file, a new flowfile would be produced.  Each "column"
from each line would be stored as a flowfile attribute.  You'd end up with
a new flowfile for every term, with a flowfile attribute containing that
term.

Here's a link to his work:

https://github.com/uwegeercken/nifi_processors

Here's an archive from the mailing list discussion:

http://mail-archives.apache.org/mod_mbox/nifi-dev/201603.mbox/%3Ctrinity-4e63574c-9f19-459f-b048-ca40667e964c-1458542998682@3capp-webde-bs02%3E

Something like this might be worth considering as well.

On Thu, Mar 31, 2016 at 9:10 AM, kkang  wrote:

> Thanks, but unfortunately I have thousands of iterations that must occur so
> this would probably be too tedious; however, it is a technique that may
> come
> in handy with smaller looped scenarios.  I am still looking at the
> solutions
> that Andy sent earlier.
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638p8658.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Tony Kurc
+1

On Thu, Mar 31, 2016 at 10:44 AM, Mark Payne  wrote:

> +1 - definitely agree that it warrants a patch release.
>
> > On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
> >
> > Team,
> >
> > I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> > important findings we've made and put JIRAs/PRs in for over the past
> > couple of days.  The most concerning is that I believe we have a
> > potential data loss issue introduced in PutKafka.  The general logic
> > of that processor is far better now but the handling of stream
> > demarcation is broken.  That is now fixed from NIFI-1701.
> >
> > I am happy to RM this if folks are in agreement.
> >
> > Thanks
> > Joe
>
>


Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Mark Payne
+1 - definitely agree that it warrants a patch release.

> On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
> 
> Team,
> 
> I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> important findings we've made and put JIRAs/PRs in for over the past
> couple of days.  The most concerning is that I believe we have a
> potential data loss issue introduced in PutKafka.  The general logic
> of that processor is far better now but the handling of stream
> demarcation is broken.  That is now fixed from NIFI-1701.
> 
> I am happy to RM this if folks are in agreement.
> 
> Thanks
> Joe



Re: [DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Matt Burgess
Sounds good to me.

> On Mar 31, 2016, at 10:29 AM, Joe Witt  wrote:
> 
> Team,
> 
> I propose that we do an Apache NiFi 0.6.1 release.  There are a few
> important findings we've made and put JIRAs/PRs in for over the past
> couple of days.  The most concerning is that I believe we have a
> potential data loss issue introduced in PutKafka.  The general logic
> of that processor is far better now but the handling of stream
> demarcation is broken.  That is now fixed from NIFI-1701.
> 
> I am happy to RM this if folks are in agreement.
> 
> Thanks
> Joe


[DISCUSS] create apache nifi 0.6.1

2016-03-31 Thread Joe Witt
Team,

I propose that we do an Apache NiFi 0.6.1 release.  There are a few
important findings we've made and put JIRAs/PRs in for over the past
couple of days.  The most concerning is that I believe we have a
potential data loss issue introduced in PutKafka.  The general logic
of that processor is far better now but the handling of stream
demarcation is broken.  That is now fixed from NIFI-1701.

I am happy to RM this if folks are in agreement.

Thanks
Joe


Re: Dynamic URLs using InvokeHttp from an array

2016-03-31 Thread Adam Taft
OK, one more "out the box" idea to consider.

UpdateAttribute also has a mode which "clones" the flowfile if multiple
rules are matched.  Here's the specific quote from the UpdateAttribute
documentation:

"If the FlowFile policy is set to "use clone", and multiple rules match,
then a copy of the incoming FlowFile is created, such that the number of
outgoing FlowFiles is equal to the number of rules that match. In other
words, if two rules (A and B) both match, then there will be two outgoing
FlowFiles, one for Rule A and one for Rule B. This can be useful in
situations where you want to add an attribute to use as a flag for routing
later. In this example, there will be two copies of the file available, one
to route for the A path, and one to route for the B path"

If you used the Advanced UI, you might be able to create rules which always
match, but alter the value of the $foo parameter to your liking.  If the
"use clone" option was set, it would create a new flowfile for every rule
matched.  Thus if your array had 10 values, you'd have 10 rules, each one
would set $foo to a different value.  Out from UpdateAttribute, you'd end
up with 10 flowfiles that could be sent to InvokeHTTP.

That might be a fun way to solve this.  :)


On Thu, Mar 31, 2016 at 9:55 AM, Adam Taft  wrote:

> One (possibly bad) idea would be to try and loop your flow around the
> UpdateAttribute processor using RouteOnAttribute.  UpdateAttribute has an
> "advanced" mode which would let you do logic something like:
>
> if $foo == "" then set $foo = "step 1";
> if $foo == "step 1" then set $foo = "step 2";
> if $foo == "step 2" then set $foo = "step 3";
> ...
> if $foo == "step n" then set $foo = "finished";
>
> The next part would be RouteOnAttribute, which would read the value of
> $foo and if set to "finished" break the loop.  Otherwise it would pass to
> InvokeHTTP and then back to UpdateAttribute.  The setup for this would be
> tedious, but I think it would technically work.
>
> Just putting this out there for brainstorming purposes.
>
>
>
>
> On Wed, Mar 30, 2016 at 6:25 PM, kkang  wrote:
>
>> I have been able to figure out how to GenerateFlowFile -> UpdateAttribute
>> ->
>> InvokeHttp to dynamically send a URL (example:
>> https://somedomain.com?parameterx=${foo}); however, I need to do this N
>> number of times and replace ${foo} with a known set of values.  Is there a
>> way to call InvokeHttp multiple times and use the next value for ${foo}
>> automatically?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638.html
>> Sent from the Apache NiFi Developer List mailing list archive at
>> Nabble.com.
>>
>
>


Re: Dynamic URLs using InvokeHttp from an array

2016-03-31 Thread kkang
Thanks, but unfortunately I have thousands of iterations that must occur so
this would probably be too tedious; however, it is a technique that may come
in handy with smaller looped scenarios.  I am still looking at the solutions
that Andy sent earlier.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638p8658.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Dynamic URLs using InvokeHttp from an array

2016-03-31 Thread Adam Taft
One (possibly bad) idea would be to try and loop your flow around the
UpdateAttribute processor using RouteOnAttribute.  UpdateAttribute has an
"advanced" mode which would let you do logic something like:

if $foo == "" then set $foo = "step 1";
if $foo == "step 1" then set $foo = "step 2";
if $foo == "step 2" then set $foo = "step 3";
...
if $foo == "step n" then set $foo = "finished";

The next part would be RouteOnAttribute, which would read the value of $foo
and if set to "finished" break the loop.  Otherwise it would pass to
InvokeHTTP and then back to UpdateAttribute.  The setup for this would be
tedious, but I think it would technically work.

Just putting this out there for brainstorming purposes.




On Wed, Mar 30, 2016 at 6:25 PM, kkang  wrote:

> I have been able to figure out how to GenerateFlowFile -> UpdateAttribute
> ->
> InvokeHttp to dynamically send a URL (example:
> https://somedomain.com?parameterx=${foo}); however, I need to do this N
> number of times and replace ${foo} with a known set of values.  Is there a
> way to call InvokeHttp multiple times and use the next value for ${foo}
> automatically?
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


[GitHub] nifi-minifi pull request: MINIFI-10 init commit of minifi-assembly

2016-03-31 Thread JPercivall
Github user JPercivall commented on a diff in the pull request:

https://github.com/apache/nifi-minifi/pull/3#discussion_r58056139
  
--- Diff: minifi-assembly/LICENSE ---
@@ -0,0 +1,1152 @@
+
+ Apache License
+   Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+  "License" shall mean the terms and conditions for use, reproduction,
+  and distribution as defined by Sections 1 through 9 of this document.
+
+  "Licensor" shall mean the copyright owner or entity authorized by
+  the copyright owner that is granting the License.
+
+  "Legal Entity" shall mean the union of the acting entity and all
+  other entities that control, are controlled by, or are under common
+  control with that entity. For the purposes of this definition,
+  "control" means (i) the power, direct or indirect, to cause the
+  direction or management of such entity, whether by contract or
+  otherwise, or (ii) ownership of fifty percent (50%) or more of the
+  outstanding shares, or (iii) beneficial ownership of such entity.
+
+  "You" (or "Your") shall mean an individual or Legal Entity
+  exercising permissions granted by this License.
+
+  "Source" form shall mean the preferred form for making modifications,
+  including but not limited to software source code, documentation
+  source, and configuration files.
+
+  "Object" form shall mean any form resulting from mechanical
+  transformation or translation of a Source form, including but
+  not limited to compiled object code, generated documentation,
+  and conversions to other media types.
+
+  "Work" shall mean the work of authorship, whether in Source or
+  Object form, made available under the License, as indicated by a
+  copyright notice that is included in or attached to the work
+  (an example is provided in the Appendix below).
+
+  "Derivative Works" shall mean any work, whether in Source or Object
+  form, that is based on (or derived from) the Work and for which the
+  editorial revisions, annotations, elaborations, or other 
modifications
+  represent, as a whole, an original work of authorship. For the 
purposes
+  of this License, Derivative Works shall not include works that remain
+  separable from, or merely link (or bind by name) to the interfaces 
of,
+  the Work and Derivative Works thereof.
+
+  "Contribution" shall mean any work of authorship, including
+  the original version of the Work and any modifications or additions
+  to that Work or Derivative Works thereof, that is intentionally
+  submitted to Licensor for inclusion in the Work by the copyright 
owner
+  or by an individual or Legal Entity authorized to submit on behalf of
+  the copyright owner. For the purposes of this definition, "submitted"
+  means any form of electronic, verbal, or written communication sent
+  to the Licensor or its representatives, including but not limited to
+  communication on electronic mailing lists, source code control 
systems,
+  and issue tracking systems that are managed by, or on behalf of, the
+  Licensor for the purpose of discussing and improving the Work, but
+  excluding communication that is conspicuously marked or otherwise
+  designated in writing by the copyright owner as "Not a Contribution."
+
+  "Contributor" shall mean Licensor and any individual or Legal Entity
+  on behalf of whom a Contribution has been received by Licensor and
+  subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  copyright license to reproduce, prepare Derivative Works of,
+  publicly display, publicly perform, sublicense, and distribute the
+  Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  (except as stated in this section) patent license to make, have made,
+  use, offer to sell, sell, import, and otherwise transfer the Work,
+  where such license applies only to those patent claims licensable
+  by such Contributor that are neces

Re: Splitting Incoming FlowFile, Output Multiple FlowFiles

2016-03-31 Thread Bryan Bende
Hello,

SplitText and SplitContent should be producing individual FlowFiles. Are
you seeing something different?

For SplitText you would set "Line Split Count" to 1 in order to get a
FlowFile for each line of the incoming CSV.

If you are doing extremely large files, it is generally recommended to do a
two-phase split where the first SplitText might have something like "Line
Split Count" set to 10,000-20,000 and then a second SplitText with "Line
Split Count" set to 1.

-Bryan


On Thu, Mar 31, 2016 at 8:35 AM, dale.chang13 
wrote:

> My specific use-case calls for ingesting a CSV table with many rows and
> then
> storing individual rows into HBase and Solar. Additionally, I would like to
> avoid developing custom processors, but it seems like the SplitText and
> SplitContent Processors do not return individual flowfiles, each with their
> own attributes.
>
> However, I was wondering what the best plan of attack would be when taking
> an incoming FlowFile and sending FlowFiles through Process Session?
> Creating
> multiple instances of Process Session? session.transfer within a loop?
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Splitting-Incoming-FlowFile-Output-Multiple-FlowFiles-tp8653.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


[GitHub] nifi-minifi pull request: MINIFI-10 init commit of minifi-assembly

2016-03-31 Thread apiri
Github user apiri commented on a diff in the pull request:

https://github.com/apache/nifi-minifi/pull/3#discussion_r58054950
  
--- Diff: minifi-assembly/LICENSE ---
@@ -0,0 +1,1152 @@
+
+ Apache License
+   Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+  "License" shall mean the terms and conditions for use, reproduction,
+  and distribution as defined by Sections 1 through 9 of this document.
+
+  "Licensor" shall mean the copyright owner or entity authorized by
+  the copyright owner that is granting the License.
+
+  "Legal Entity" shall mean the union of the acting entity and all
+  other entities that control, are controlled by, or are under common
+  control with that entity. For the purposes of this definition,
+  "control" means (i) the power, direct or indirect, to cause the
+  direction or management of such entity, whether by contract or
+  otherwise, or (ii) ownership of fifty percent (50%) or more of the
+  outstanding shares, or (iii) beneficial ownership of such entity.
+
+  "You" (or "Your") shall mean an individual or Legal Entity
+  exercising permissions granted by this License.
+
+  "Source" form shall mean the preferred form for making modifications,
+  including but not limited to software source code, documentation
+  source, and configuration files.
+
+  "Object" form shall mean any form resulting from mechanical
+  transformation or translation of a Source form, including but
+  not limited to compiled object code, generated documentation,
+  and conversions to other media types.
+
+  "Work" shall mean the work of authorship, whether in Source or
+  Object form, made available under the License, as indicated by a
+  copyright notice that is included in or attached to the work
+  (an example is provided in the Appendix below).
+
+  "Derivative Works" shall mean any work, whether in Source or Object
+  form, that is based on (or derived from) the Work and for which the
+  editorial revisions, annotations, elaborations, or other 
modifications
+  represent, as a whole, an original work of authorship. For the 
purposes
+  of this License, Derivative Works shall not include works that remain
+  separable from, or merely link (or bind by name) to the interfaces 
of,
+  the Work and Derivative Works thereof.
+
+  "Contribution" shall mean any work of authorship, including
+  the original version of the Work and any modifications or additions
+  to that Work or Derivative Works thereof, that is intentionally
+  submitted to Licensor for inclusion in the Work by the copyright 
owner
+  or by an individual or Legal Entity authorized to submit on behalf of
+  the copyright owner. For the purposes of this definition, "submitted"
+  means any form of electronic, verbal, or written communication sent
+  to the Licensor or its representatives, including but not limited to
+  communication on electronic mailing lists, source code control 
systems,
+  and issue tracking systems that are managed by, or on behalf of, the
+  Licensor for the purpose of discussing and improving the Work, but
+  excluding communication that is conspicuously marked or otherwise
+  designated in writing by the copyright owner as "Not a Contribution."
+
+  "Contributor" shall mean Licensor and any individual or Legal Entity
+  on behalf of whom a Contribution has been received by Licensor and
+  subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  copyright license to reproduce, prepare Derivative Works of,
+  publicly display, publicly perform, sublicense, and distribute the
+  Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants to You a perpetual,
+  worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+  (except as stated in this section) patent license to make, have made,
+  use, offer to sell, sell, import, and otherwise transfer the Work,
+  where such license applies only to those patent claims licensable
+  by such Contributor that are necessaril

Splitting Incoming FlowFile, Output Multiple FlowFiles

2016-03-31 Thread dale.chang13
My specific use-case calls for ingesting a CSV table with many rows and then
storing individual rows into HBase and Solar. Additionally, I would like to
avoid developing custom processors, but it seems like the SplitText and
SplitContent Processors do not return individual flowfiles, each with their
own attributes.

However, I was wondering what the best plan of attack would be when taking
an incoming FlowFile and sending FlowFiles through Process Session? Creating
multiple instances of Process Session? session.transfer within a loop?



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Splitting-Incoming-FlowFile-Output-Multiple-FlowFiles-tp8653.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Text and metadata extraction processor

2016-03-31 Thread Joe Skora
Dmitry,

Looking at this and your prior email.


   1. I can see "extract metadata only" being as popular as "extract
   metadata and content".  It will all depend on the type of media, for
   audio/video files adding the metadata to the flow file is enough but for
   Word, PDF, etc. files the content may be wanted as well.
   2. After thinking about it, I agree on an enum for mode.
   3. I think any handling of zips or archive files should be handled by
   another processor, that keeps this processor cleaner and improves its
   ability for re-use.
   4. I like the addition of exclude filters but I'm not sure about adding
   content filters.  We will only have a mimetype for the original flow file
   itself so I'm not sure about the metadata mimetype filter.  I think content
   filtering may be best left for another downstream processor, but it might
   be run faster if included here since the entire content will be handled
   during extraction.  If the content filters are implemented, for performance
   they need to short circuit so that if the property is not set or is set to
   ".*" they don't evaluate the regex.
   1. FILENAME_FILTER - selects flow files to process based on filename
  matching regex. (exists)
  2. MIMETYPE_FILTER - selects flow files to process based on mimetype
  matching regex. (exists)
  3. FILENAME_EXCLUDE - excludes already selected flow files from
  processing based on filename matching regex. (new)
  4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
  processing based on mimetype matching regex. (new)
  5. CONTENT_FILTER (optional) - selects flow files for output based on
  extracted content matching regex. (new)
  6. CONTENT_EXCLUDE (optional) - excludes flow files from output based
  on extracted content matching regex. (new)
   5. As indicated in the descriptions in #4, I don't think overlapping
   filters are an error, instead excludes should take precedence over
   includes.  Then I can include a domain (like A*) but exclude sub-sets (like
   AXYZ*).

I'm sure there's something we missed, but I think that covers most of it.

Regards,
Joe


On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg  wrote:

> Joe,
>
> Upon some thinking, I've started wondering whether all the cases can be
> covered by the following filters:
>
> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> files get their content extracted, by file name
> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
> files get their metadata extracted, by file name
> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> files get their content extracted, by MIME type
> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
> files get their metadata extracted, by MIME type
>
> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
> files do NOT get their content extracted, by file name
> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
> files do NOT get their metadata extracted, by file name
> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
> files do NOT get their content extracted, by MIME type
> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
> files do NOT get their metadata extracted, by MIME type
>
> I believe this gets all the bases covered. At processor init time, we can
> analyze the inclusions vs. exclusions; any overlap would cause a
> configuration error.
>
> Let me know what you think, thanks.
> - Dmitry
>
> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
> dgoldenb...@hexastax.com> wrote:
>
> > Hi Joe,
> >
> > I follow your reasoning on the semantics of "media".  One might argue
> that
> > media files are a case of "document" or that a document is a case of
> > "media".
> >
> > I'm not proposing filters for the mode of processing, I'm proposing a
> > flag/enum with 3 values:
> >
> > A) extract metadata only;
> > B) extract content only and place it into the flowfile content;
> > C) extract both metadata and content.
> >
> > I think the default should be C, to extract both.  At least in my
> > experience most flows I've dealt with were interested in extracting both.
> >
> > I don't see how this mode would benefit from being expression driven - ?
> >
> > I think we can add this enum mode and have the basic use case covered.
> >
> > Additionally, further down the line, I was thinking we could ponder the
> > following (these have been essential in search engine ingestion):
> >
> >1. Extraction from compressed files/archives. How would UnpackContent
> >work with ExtractMediaAttributes? Use-case being, we've got a zip
> file as
> >input and want to crack it open and unravel it recursively; it may
> have
> >other, nested zips inside, along with other documents. One way to
> handle
> >this is to treat the whole archive as one document and merge all
> attributes
> >   

Re: Import Kafka messages into Titan

2016-03-31 Thread Matt Burgess
Idioma,

There is not yet a JSON-to-JSON translator, although there is a Jira case
to add it (https://issues.apache.org/jira/browse/NIFI-361) However you have
a handful of options here:

1) If you don't have a specific output format in mind, use Simon's approach
to generate a JSON file with the desired attributes.
2) Get the appropriate JSON elements into attributes using
EvaluateJsonPath, then ReplaceText for the new format
3) Use EvaluateJsonPath as above, then use Uwe's template processor (
https://github.com/uwegeercken/nifi_processors)
4) Use ExecuteScript and write a script in Groovy, Javascript, Jython,
JRuby, or Lua to do the custom translation
5) Write your own processor (and please feel welcome to share it with the
community!)

If you go with Option 4, I have some Groovy code that builds a
Tinkerpop3-compliant GraphSON document from NiFi provenance events, the
builder pattern might be useful to you if you need GraphSON for the Titan
backend. The code is available as a Gist (
https://gist.github.com/mattyb149/a43a5c12c39701c4a2feeed71a57c66c), and I
have other examples of JSON-to-JSON conversion on my blog (
funnifi.blogspot.com).

Regards,
Matt

On Thu, Mar 31, 2016 at 7:20 AM, idioma  wrote:

> Hi,
> I am very new to NiFi and I have the following flow:
>
> Consume Messages from Kafka based on a particular topic (JSON format)
> ->Transform JSON format into some Titan-compliant format -> put them into
> Titan/ElasticSearch on AWS
>
> I have done researching and I believe I can set use the standard processor
> GetKafka and PutElasticSearch for the two "extremes" of the process flow.
> Can you confirm this? Would I need to write my own processor? (I am working
> with Java) I feel that I would need to write a Java processor for the
> actual
> transformation from JSON format into a graph one. Is that correct? Any
> suitable resource/project that can be useful to get me going with this?
>
> Thanks,
>
> I.
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


On Thu, Mar 31, 2016 at 8:31 AM, Simon Ball  wrote:

> You don’t necessarily need a custom processor for this. To convert the
> JSON to key values for a graph for example, you can use EvaluateJsonPath on
> your incoming messages from Kafka, this will pull out the pieces you need,
> then use AttributesToJson to select these attributes back into JSON to push
> to the PutElastic processor.
>
> Simon
>
> > On 31 Mar 2016, at 12:20, idioma  wrote:
> >
> > Hi,
> > I am very new to NiFi and I have the following flow:
> >
> > Consume Messages from Kafka based on a particular topic (JSON format)
> > ->Transform JSON format into some Titan-compliant format -> put them into
> > Titan/ElasticSearch on AWS
> >
> > I have done researching and I believe I can set use the standard
> processor
> > GetKafka and PutElasticSearch for the two "extremes" of the process flow.
> > Can you confirm this? Would I need to write my own processor? (I am
> working
> > with Java) I feel that I would need to write a Java processor for the
> actual
> > transformation from JSON format into a graph one. Is that correct? Any
> > suitable resource/project that can be useful to get me going with this?
> >
> > Thanks,
> >
> > I.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647.html
> > Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
> >
>
>


Re: Data Provenance missing information

2016-03-31 Thread Matthew Clarke
Ofir,
Can you share a little more information about your setup?  What version
of NiFi are you working with?  Is it a standalone instance of NiFi or a
cluster?  Are you seeing any WARN or ERROR log  messgaes realted to
provenance in either your nifi-app.log (NCM and Node if cluster)?  Are you
doing a targeted provenance query via the search capability? Would you be
willing to share a template of this process of group?

Thanks,
Matt

On Thu, Mar 31, 2016 at 3:58 AM, Ofir  wrote:

> Hello.
>
> I'm experiencing a tough-to-debug problem with nifi.
>
> I can see the data provenance-related information from yesterday, but all
> the new information seems to disappear.
> Duplication the Processor-group seem to solve the problem temporarily, but
> I
> hope someone will have a better insight on the subject.
>
> Thanks, Ofir.
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Data-Provenance-missing-information-tp8643.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>


Re: Import Kafka messages into Titan

2016-03-31 Thread idioma
Simon,
thanks for this. This sounds very reasonable. I have a very naive question
on top my initial now, I am afraid. If I end up using 4 standard processors
(GetKafka -> EvaluateJsonPath -> AttributesToJson -> PutElastic) from a Java
application, how do I bundle them? 

Thanks indeed!



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647p8649.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Import Kafka messages into Titan

2016-03-31 Thread Simon Ball
You don’t necessarily need a custom processor for this. To convert the JSON to 
key values for a graph for example, you can use EvaluateJsonPath on your 
incoming messages from Kafka, this will pull out the pieces you need, then use 
AttributesToJson to select these attributes back into JSON to push to the 
PutElastic processor. 

Simon

> On 31 Mar 2016, at 12:20, idioma  wrote:
> 
> Hi,
> I am very new to NiFi and I have the following flow:
> 
> Consume Messages from Kafka based on a particular topic (JSON format)
> ->Transform JSON format into some Titan-compliant format -> put them into
> Titan/ElasticSearch on AWS
> 
> I have done researching and I believe I can set use the standard processor
> GetKafka and PutElasticSearch for the two "extremes" of the process flow.
> Can you confirm this? Would I need to write my own processor? (I am working
> with Java) I feel that I would need to write a Java processor for the actual
> transformation from JSON format into a graph one. Is that correct? Any
> suitable resource/project that can be useful to get me going with this? 
> 
> Thanks,
> 
> I.
> 
> 
> 
> --
> View this message in context: 
> http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
> 



Import Kafka messages into Titan

2016-03-31 Thread idioma
Hi,
I am very new to NiFi and I have the following flow:

Consume Messages from Kafka based on a particular topic (JSON format)
->Transform JSON format into some Titan-compliant format -> put them into
Titan/ElasticSearch on AWS

I have done researching and I believe I can set use the standard processor
GetKafka and PutElasticSearch for the two "extremes" of the process flow.
Can you confirm this? Would I need to write my own processor? (I am working
with Java) I feel that I would need to write a Java processor for the actual
transformation from JSON format into a graph one. Is that correct? Any
suitable resource/project that can be useful to get me going with this? 

Thanks,

I.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Import-Kafka-messages-into-Titan-tp8647.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: How to only take new rows using ExecuteSQL processor?

2016-03-31 Thread Simon Ball
Hi Paul,

In the scenario where you need complex joins and incremental loads, the best 
bet is probably to create a view in your database with the query required. The 
QueryDatabaseTable can operate against this view as long as the view has a 
suitable ‘id’ column in it. 

That would provide a work around for the lack of a custom query at the moment. 

I also not on your ticket an excellent point about limiting the return batch 
size. You can achieve something like this with the query max time setting, but 
it would certainly be a good addition (and maybe deserves its own ticket).

Simon

> On 31 Mar 2016, at 10:15, Paul Bormans  wrote:
> 
> Hi Jou,
> 
> Thank you for the tip: great!!!
> 
> I guess nifi is too new still because i did some extensive searching on
> this subject and QueryDatabaseTable was not mentioned.
> 
> This processor does exactly what i expect/need!
> 
> One shortcoming... maybe i should enter a ticket for this. Usually
> extraction of data from rdbms involves complex queries with joins and these
> are not supported as far as i can see. We could also extend the processor
> so that a configuration option is to specify the full query which i believe
> is much more flexible than enumerating columns from a specific table.
> 
> Paul
> 
> 
> 
> On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt  wrote:
> 
>> Paul,
>> 
>> In Apache NiFi 0.6.0 if you're looking for a change capture type
>> mechanism to source from relational databases take a look at
>> QueryDatabaseTable [1].
>> 
>> That processor is new and any feedback and or contribs for it would be
>> awesome.
>> 
>> ExecuteSQL does have some time driven use cases to capture snapshots
>> and such but you're right that it doesn't sound like a good fit for
>> your case.
>> 
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>> 
>> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans  wrote:
>>> I'm evaluating Apache Nifi as data ingestion tool to load data from an
>>> RDBMS into S3. A first test shows odd behavior where the same rows are
>>> written to the flowfile over and over again while i expected that only
>> new
>>> rows are written.
>>> 
>>> In fact i was missing configuration options to specify what column could
>> be
>>> used to query only for new rows.
>>> 
>>> Taking a look at the processor implementation makes me believe that the
>>> only option is to define a query including OFFSET n LIMIT m where "n" is
>>> dynamically set based upon previous onTriggers; would this even be
>> possible?
>>> 
>>> Some setup info:
>>> nifi: 0.6.0
>>> backend: postgresql
>>> driver: postgresql-9.4.1208.jre6.jar
>>> query: select * from addresses
>>> 
>>> More in general i don't see a use-case where the current ExecuteSQL
>>> processor fits as a processor (without input flowfile). Someone can
>> explain?
>>> 
>>> Paul
>> 



Re: How to only take new rows using ExecuteSQL processor?

2016-03-31 Thread Paul Bormans
Issue submitted: https://issues.apache.org/jira/browse/NIFI-1706

On Thu, Mar 31, 2016 at 11:15 AM, Paul Bormans  wrote:

> Hi Jou,
>
> Thank you for the tip: great!!!
>
> I guess nifi is too new still because i did some extensive searching on
> this subject and QueryDatabaseTable was not mentioned.
>
> This processor does exactly what i expect/need!
>
> One shortcoming... maybe i should enter a ticket for this. Usually
> extraction of data from rdbms involves complex queries with joins and these
> are not supported as far as i can see. We could also extend the processor
> so that a configuration option is to specify the full query which i believe
> is much more flexible than enumerating columns from a specific table.
>
> Paul
>
>
>
> On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt  wrote:
>
>> Paul,
>>
>> In Apache NiFi 0.6.0 if you're looking for a change capture type
>> mechanism to source from relational databases take a look at
>> QueryDatabaseTable [1].
>>
>> That processor is new and any feedback and or contribs for it would be
>> awesome.
>>
>> ExecuteSQL does have some time driven use cases to capture snapshots
>> and such but you're right that it doesn't sound like a good fit for
>> your case.
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>>
>> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans  wrote:
>> > I'm evaluating Apache Nifi as data ingestion tool to load data from an
>> > RDBMS into S3. A first test shows odd behavior where the same rows are
>> > written to the flowfile over and over again while i expected that only
>> new
>> > rows are written.
>> >
>> > In fact i was missing configuration options to specify what column
>> could be
>> > used to query only for new rows.
>> >
>> > Taking a look at the processor implementation makes me believe that the
>> > only option is to define a query including OFFSET n LIMIT m where "n" is
>> > dynamically set based upon previous onTriggers; would this even be
>> possible?
>> >
>> > Some setup info:
>> > nifi: 0.6.0
>> > backend: postgresql
>> > driver: postgresql-9.4.1208.jre6.jar
>> > query: select * from addresses
>> >
>> > More in general i don't see a use-case where the current ExecuteSQL
>> > processor fits as a processor (without input flowfile). Someone can
>> explain?
>> >
>> > Paul
>>
>
>


Re: How to only take new rows using ExecuteSQL processor?

2016-03-31 Thread Paul Bormans
Hi Jou,

Thank you for the tip: great!!!

I guess nifi is too new still because i did some extensive searching on
this subject and QueryDatabaseTable was not mentioned.

This processor does exactly what i expect/need!

One shortcoming... maybe i should enter a ticket for this. Usually
extraction of data from rdbms involves complex queries with joins and these
are not supported as far as i can see. We could also extend the processor
so that a configuration option is to specify the full query which i believe
is much more flexible than enumerating columns from a specific table.

Paul



On Wed, Mar 30, 2016 at 5:19 PM, Joe Witt  wrote:

> Paul,
>
> In Apache NiFi 0.6.0 if you're looking for a change capture type
> mechanism to source from relational databases take a look at
> QueryDatabaseTable [1].
>
> That processor is new and any feedback and or contribs for it would be
> awesome.
>
> ExecuteSQL does have some time driven use cases to capture snapshots
> and such but you're right that it doesn't sound like a good fit for
> your case.
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
>
> On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans  wrote:
> > I'm evaluating Apache Nifi as data ingestion tool to load data from an
> > RDBMS into S3. A first test shows odd behavior where the same rows are
> > written to the flowfile over and over again while i expected that only
> new
> > rows are written.
> >
> > In fact i was missing configuration options to specify what column could
> be
> > used to query only for new rows.
> >
> > Taking a look at the processor implementation makes me believe that the
> > only option is to define a query including OFFSET n LIMIT m where "n" is
> > dynamically set based upon previous onTriggers; would this even be
> possible?
> >
> > Some setup info:
> > nifi: 0.6.0
> > backend: postgresql
> > driver: postgresql-9.4.1208.jre6.jar
> > query: select * from addresses
> >
> > More in general i don't see a use-case where the current ExecuteSQL
> > processor fits as a processor (without input flowfile). Someone can
> explain?
> >
> > Paul
>


Data Provenance missing information

2016-03-31 Thread Ofir
Hello.

I'm experiencing a tough-to-debug problem with nifi.

I can see the data provenance-related information from yesterday, but all
the new information seems to disappear. 
Duplication the Processor-group seem to solve the problem temporarily, but I
hope someone will have a better insight on the subject.

Thanks, Ofir.



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Data-Provenance-missing-information-tp8643.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


[GitHub] nifi pull request: NIFI-1701 fixed StreamScanner, added more tests

2016-03-31 Thread joewitt
Github user joewitt commented on a diff in the pull request:

https://github.com/apache/nifi/pull/314#discussion_r58008694
  
--- Diff: 
nifi-nar-bundles/nifi-kafka-bundle/nifi-kafka-processors/src/main/java/org/apache/nifi/processors/kafka/StreamScanner.java
 ---
@@ -55,29 +56,53 @@
  */
 boolean hasNext() {
 this.data = null;
-if (!this.eos) {
+int j = 0;
+boolean moreData = true;
+byte b;
+while (this.data == null) {
+this.expandBufferIfNecessary();
 try {
-boolean keepReading = true;
-while (keepReading) {
-byte b = (byte) this.is.read();
-if (b > -1) {
-baos.write(b);
-if (buffer.addAndCompare(b)) {
-this.data = 
Arrays.copyOfRange(baos.getUnderlyingBuffer(), 0, baos.size() - 
delimiter.length);
-keepReading = false;
-}
-} else {
-this.data = baos.toByteArray();
-keepReading = false;
-this.eos = true;
+b = (byte) this.is.read();
--- End diff --

is.read() returns an int and -1 and represents end of stream.  Once that 
has been verified can the byte value as represented in the range of 0 to 255.  
Will submit a patch containing a test that proves it is broken and will make 
the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---