[GitHub] nifi pull request: NIFI-1701 fixed StreamScanner, added more tests

2016-03-30 Thread olegz
GitHub user olegz opened a pull request:

https://github.com/apache/nifi/pull/314

NIFI-1701 fixed StreamScanner, added more tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/olegz/nifi NIFI-1701

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/314.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #314


commit bab1881abe462a022c98e53d21ee6c2cce91f395
Author: Oleg Zhurakousky 
Date:   2016-03-31T04:59:26Z

NIFI-1701 fixed StreamScanner, added more tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Dynamic URLs using InvokeHttp from an array

2016-03-30 Thread Andy LoPresto
This sounds like a good candidate for the `ExecuteScript` processor. Matt 
Burgess has written some good tutorials on using that here [1] [2]. You could 
also write a custom processor that extends `InvokeHTTP` and uses the new state 
management features [3] to keep a counter value, an iteration limit, and the 
known values.

[1] 
http://funnifi.blogspot.com/2016/02/writing-reusable-scripted-processors-in.html
 

[2] 
http://funnifi.blogspot.com/2016/03/executescript-json-to-json-revisited_14.html
 

[3] 
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#state_management
 


Andy LoPresto
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 30, 2016, at 3:25 PM, kkang  wrote:
> 
> I have been able to figure out how to GenerateFlowFile -> UpdateAttribute ->
> InvokeHttp to dynamically send a URL (example:
> https://somedomain.com?parameterx=${foo}); however, I need to do this N
> number of times and replace ${foo} with a known set of values.  Is there a
> way to call InvokeHttp multiple times and use the next value for ${foo}
> automatically?
> 
> 
> 
> --
> View this message in context: 
> http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.



signature.asc
Description: Message signed with OpenPGP using GPGMail


Dynamic URLs using InvokeHttp from an array

2016-03-30 Thread kkang
I have been able to figure out how to GenerateFlowFile -> UpdateAttribute ->
InvokeHttp to dynamically send a URL (example:
https://somedomain.com?parameterx=${foo}); however, I need to do this N
number of times and replace ${foo} with a known set of values.  Is there a
way to call InvokeHttp multiple times and use the next value for ${foo}
automatically?



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Dynamic-URLs-using-InvokeHttp-from-an-array-tp8638.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


[GitHub] nifi-minifi pull request: MINIFI-10 init commit of minifi-assembly

2016-03-30 Thread JPercivall
GitHub user JPercivall opened a pull request:

https://github.com/apache/nifi-minifi/pull/3

MINIFI-10 init commit of minifi-assembly

I created a minifi-assembly that compiles and creates the initial LICENSE, 
NOTICE, bin and conf directories in a distribution. When deps get added to the 
bootstrap I believe it will need to be added to minifi-assembly.

I re-used much of the same scripts, pom.xml and dependencies.xml from NIFI 
(reconfigured for minifi) but I assume that since MiNiFi is a sub project of 
NiFi we don't need to provide a separate notice for work taken directly from 
it. I did add NiFi in the NOTICE for the assembly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JPercivall/nifi-minifi MINIFI-10

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi-minifi/pull/3.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3


commit 3e49921f4703aa5a94d30746204228f3b76572ee
Author: Joseph Percivall 
Date:   2016-03-30T21:17:03Z

MINIFI-10 init commit of minifi-assembly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1686 - NiFi is unable to populate over 1/4...

2016-03-30 Thread steveyh25
Github user steveyh25 commented on the pull request:

https://github.com/apache/nifi/pull/305#issuecomment-203657226
  
@olegz done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1698 Improving customValidate in AbstractH...

2016-03-30 Thread bbende
GitHub user bbende opened a pull request:

https://github.com/apache/nifi/pull/313

NIFI-1698 Improving customValidate in AbstractHadoopProcessor and HBa…

…seClient service to not reload Configuration unless it changed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bbende/nifi NIFI-1698

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #313


commit 1859ed8660219650a54fbcf5504809302c7c6ac9
Author: Bryan Bende 
Date:   2016-03-30T19:56:22Z

NIFI-1698 Improving customValidate in AbstractHadoopProcessor and 
HBaseClient service to not reload Configuration unless it changed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi-minifi pull request: MINIFI-8 Travis CI support

2016-03-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nifi-minifi/pull/2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi-minifi pull request: MINIFI-8 Travis CI support

2016-03-30 Thread apiri
GitHub user apiri opened a pull request:

https://github.com/apache/nifi-minifi/pull/2

MINIFI-8 Travis CI support

 Adding a .travis.yml to provide GitHub integration with Travis CI.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apiri/nifi-minifi MINIFI-8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi-minifi/pull/2.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2


commit 6e7a147c1c6524d3429232ddf594ed4d66caa623
Author: Aldrin Piri 
Date:   2016-03-30T19:01:38Z

MINIFI-8 Adding a .travis.yml to provide GitHub integration with Travis CI.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi-minifi pull request: MIFNI-2 Created a Util class to transfor...

2016-03-30 Thread JPercivall
GitHub user JPercivall opened a pull request:

https://github.com/apache/nifi-minifi/pull/1

MIFNI-2 Created a Util class to transform prospective config.yml into…

… flow.xml and nifi.properties

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JPercivall/nifi-minifi MINIFI-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi-minifi/pull/1.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1


commit 9e102dfc096ea76b71ef4d79fdfd13a16123178d
Author: Joseph Percivall 
Date:   2016-03-30T18:10:18Z

MIFNI-2 Created a Util class to transform prospective config.yml into 
flow.xml and nifi.properties




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1686 - NiFi is unable to populate over 1/4...

2016-03-30 Thread olegz
Github user olegz commented on the pull request:

https://github.com/apache/nifi/pull/305#issuecomment-203553783
  
@steveyh25 @petmit Guys, thank you so much for your work and collaboration. 
I've reviewed and giving it +1. @steveyh25 please squash your commits so we 
give it one more look and merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Text and metadata extraction processor

2016-03-30 Thread Dmitry Goldenberg
Joe,

Upon some thinking, I've started wondering whether all the cases can be
covered by the following filters:

INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
files get their content extracted, by file name
INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
files get their metadata extracted, by file name
INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
files get their content extracted, by MIME type
INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
files get their metadata extracted, by MIME type

EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
files do NOT get their content extracted, by file name
EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input
files do NOT get their metadata extracted, by file name
EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
files do NOT get their content extracted, by MIME type
EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input
files do NOT get their metadata extracted, by MIME type

I believe this gets all the bases covered. At processor init time, we can
analyze the inclusions vs. exclusions; any overlap would cause a
configuration error.

Let me know what you think, thanks.
- Dmitry

On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
dgoldenb...@hexastax.com> wrote:

> Hi Joe,
>
> I follow your reasoning on the semantics of "media".  One might argue that
> media files are a case of "document" or that a document is a case of
> "media".
>
> I'm not proposing filters for the mode of processing, I'm proposing a
> flag/enum with 3 values:
>
> A) extract metadata only;
> B) extract content only and place it into the flowfile content;
> C) extract both metadata and content.
>
> I think the default should be C, to extract both.  At least in my
> experience most flows I've dealt with were interested in extracting both.
>
> I don't see how this mode would benefit from being expression driven - ?
>
> I think we can add this enum mode and have the basic use case covered.
>
> Additionally, further down the line, I was thinking we could ponder the
> following (these have been essential in search engine ingestion):
>
>1. Extraction from compressed files/archives. How would UnpackContent
>work with ExtractMediaAttributes? Use-case being, we've got a zip file as
>input and want to crack it open and unravel it recursively; it may have
>other, nested zips inside, along with other documents. One way to handle
>this is to treat the whole archive as one document and merge all attributes
>into one FlowFile.  The other way would be to treat each archive entry as
>its own flow file and keep a pointer back at the parent archive.  Yet
>another case is when the user might want to only extract the 'leaf' entries
>and discard any parent container archives.
>
>2. Attachments and embeddings. Users may want to treat any attached or
>embedded files as separate flowfiles with perhaps pointers back to the
>parent files. This definitely warrants a filter. Oftentimes Office
>documents have 'media' embeddings which are often not of interest,
>especially for the case of ingesting into a search engine.
>
>3. PDF. For PDF's, we can do OCR. This is important for the
>'image'/scanned PDF's for which Tika won't extract text.
>
> I'd like to understand how much of this is already supported in NiFi and
> if not I'd volunteer/collaborate to implement some of this.
>
> - Dmitry
>
>
> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora  wrote:
>
>> Dmitry,
>>
>> Are you proposing separate filters that determine the mode of processing,
>> metadata/content/metadataAndContent?  I was thinking of one selection
>> filters and a static mode switch at the processor instance level, to make
>> configuration more obvious such that one instance of the processor will
>> handle a known set of files regardless of the processing mode.
>>
>> I was thinking it would be useful for the mode switch to support
>> expression
>> language, but I'm not sure about that since the selection filters will
>> control what files get processed and it would be harder to configure if
>> the
>> output flow file could vary between source format and extracted text.  So,
>> while it might be easy to do, and occasionally useful, I think in normal
>> use I'd never have a varying mode but would more likely have multiple
>> processor instances with some routing or selection going on further
>> upstream.
>>
>> I wrestled with the naming issue too.  I went with
>> "ExtractMediaAttributes"
>> over "ExtractDocumentAttributes" because it seemed to represent the
>> broader
>> context better.  In reality, media files and documents and documents are
>> media files, but in the end it's all just semantics.
>>
>> I don't think I would change the NAR bundle name, because I think
>> "nifi-media-nar" establishes it as a place to collect this and oth

[GitHub] nifi pull request: NIFI-1697 Ensuring FlowController appropriately...

2016-03-30 Thread mcgilman
Github user mcgilman commented on a diff in the pull request:

https://github.com/apache/nifi/pull/312#discussion_r57917391
  
--- Diff: 
nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/FlowController.java
 ---
@@ -2539,6 +2539,12 @@ private ProcessorStatus getProcessorStatus(final 
RepositoryStatusReport report,
 return status;
 }
 
+private boolean isValid(ProcessorNode procNode) {
+try (final NarCloseable narCloseable = 
NarCloseable.withNarLoader()) {
+return procNode.isValid();
--- End diff --

Like we do here [1].

[1] 
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L956


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1697 Ensuring FlowController appropriately...

2016-03-30 Thread mcgilman
Github user mcgilman commented on a diff in the pull request:

https://github.com/apache/nifi/pull/312#discussion_r57917189
  
--- Diff: 
nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/FlowController.java
 ---
@@ -2539,6 +2539,12 @@ private ProcessorStatus getProcessorStatus(final 
RepositoryStatusReport report,
 return status;
 }
 
+private boolean isValid(ProcessorNode procNode) {
+try (final NarCloseable narCloseable = 
NarCloseable.withNarLoader()) {
+return procNode.isValid();
--- End diff --

I believe we want to wrap the call into validate() on the Processor in 
procNode.isValid() rather than the entire isValid() method [1].

[1] 
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/StandardProcessorNode.java#L911


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-1697 Ensuring FlowController appropriately...

2016-03-30 Thread bbende
GitHub user bbende opened a pull request:

https://github.com/apache/nifi/pull/312

NIFI-1697 Ensuring FlowController appropriately wraps code with NarClosable

From debugging this issue it was noticed that the problem only occurred 
while a PutHDFS processor was enabled (running/stopepd), but if it was disabled 
the problem went away. This led to realizing that when the processor is running 
or stopped, validation is being called and was not being wrapped with 
NarCloseable to ensure the validation uses the same classpath as the component 
uses when executing.

This PR ensures that the FlowController wraps validation with NarCloseable, 
and also when calling OnPrimaryNodeStateChanged which needed to be wrapped as 
well.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bbende/nifi NIFI-1697

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nifi/pull/312.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #312


commit 605c6c88b53076034560d5faf3399c7e2b3e98fd
Author: Bryan Bende 
Date:   2016-03-30T15:46:52Z

NIFI-1697 Ensuring FlowController appropriately wraps code with NarCloseable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi pull request: NIFI-483: Use ZooKeeper Leader Election to Auto...

2016-03-30 Thread mcgilman
Github user mcgilman commented on the pull request:

https://github.com/apache/nifi/pull/301#issuecomment-203486387
  
@markap14 I am continually seeing tests failures both locally and in travis 
for this PR. See the travis link above.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: How to only take new rows using ExecuteSQL processor?

2016-03-30 Thread Joe Witt
Paul,

In Apache NiFi 0.6.0 if you're looking for a change capture type
mechanism to source from relational databases take a look at
QueryDatabaseTable [1].

That processor is new and any feedback and or contribs for it would be awesome.

ExecuteSQL does have some time driven use cases to capture snapshots
and such but you're right that it doesn't sound like a good fit for
your case.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html

On Wed, Mar 30, 2016 at 9:11 AM, Paul Bormans  wrote:
> I'm evaluating Apache Nifi as data ingestion tool to load data from an
> RDBMS into S3. A first test shows odd behavior where the same rows are
> written to the flowfile over and over again while i expected that only new
> rows are written.
>
> In fact i was missing configuration options to specify what column could be
> used to query only for new rows.
>
> Taking a look at the processor implementation makes me believe that the
> only option is to define a query including OFFSET n LIMIT m where "n" is
> dynamically set based upon previous onTriggers; would this even be possible?
>
> Some setup info:
> nifi: 0.6.0
> backend: postgresql
> driver: postgresql-9.4.1208.jre6.jar
> query: select * from addresses
>
> More in general i don't see a use-case where the current ExecuteSQL
> processor fits as a processor (without input flowfile). Someone can explain?
>
> Paul


How to only take new rows using ExecuteSQL processor?

2016-03-30 Thread Paul Bormans
I'm evaluating Apache Nifi as data ingestion tool to load data from an
RDBMS into S3. A first test shows odd behavior where the same rows are
written to the flowfile over and over again while i expected that only new
rows are written.

In fact i was missing configuration options to specify what column could be
used to query only for new rows.

Taking a look at the processor implementation makes me believe that the
only option is to define a query including OFFSET n LIMIT m where "n" is
dynamically set based upon previous onTriggers; would this even be possible?

Some setup info:
nifi: 0.6.0
backend: postgresql
driver: postgresql-9.4.1208.jre6.jar
query: select * from addresses

More in general i don't see a use-case where the current ExecuteSQL
processor fits as a processor (without input flowfile). Someone can explain?

Paul


Re: Text and metadata extraction processor

2016-03-30 Thread Dmitry Goldenberg
Hi Joe,

I follow your reasoning on the semantics of "media".  One might argue that
media files are a case of "document" or that a document is a case of
"media".

I'm not proposing filters for the mode of processing, I'm proposing a
flag/enum with 3 values:

A) extract metadata only;
B) extract content only and place it into the flowfile content;
C) extract both metadata and content.

I think the default should be C, to extract both.  At least in my
experience most flows I've dealt with were interested in extracting both.

I don't see how this mode would benefit from being expression driven - ?

I think we can add this enum mode and have the basic use case covered.

Additionally, further down the line, I was thinking we could ponder the
following (these have been essential in search engine ingestion):

   1. Extraction from compressed files/archives. How would UnpackContent
   work with ExtractMediaAttributes? Use-case being, we've got a zip file as
   input and want to crack it open and unravel it recursively; it may have
   other, nested zips inside, along with other documents. One way to handle
   this is to treat the whole archive as one document and merge all attributes
   into one FlowFile.  The other way would be to treat each archive entry as
   its own flow file and keep a pointer back at the parent archive.  Yet
   another case is when the user might want to only extract the 'leaf' entries
   and discard any parent container archives.

   2. Attachments and embeddings. Users may want to treat any attached or
   embedded files as separate flowfiles with perhaps pointers back to the
   parent files. This definitely warrants a filter. Oftentimes Office
   documents have 'media' embeddings which are often not of interest,
   especially for the case of ingesting into a search engine.

   3. PDF. For PDF's, we can do OCR. This is important for the
   'image'/scanned PDF's for which Tika won't extract text.

I'd like to understand how much of this is already supported in NiFi and if
not I'd volunteer/collaborate to implement some of this.

- Dmitry


On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora  wrote:

> Dmitry,
>
> Are you proposing separate filters that determine the mode of processing,
> metadata/content/metadataAndContent?  I was thinking of one selection
> filters and a static mode switch at the processor instance level, to make
> configuration more obvious such that one instance of the processor will
> handle a known set of files regardless of the processing mode.
>
> I was thinking it would be useful for the mode switch to support expression
> language, but I'm not sure about that since the selection filters will
> control what files get processed and it would be harder to configure if the
> output flow file could vary between source format and extracted text.  So,
> while it might be easy to do, and occasionally useful, I think in normal
> use I'd never have a varying mode but would more likely have multiple
> processor instances with some routing or selection going on further
> upstream.
>
> I wrestled with the naming issue too.  I went with "ExtractMediaAttributes"
> over "ExtractDocumentAttributes" because it seemed to represent the broader
> context better.  In reality, media files and documents and documents are
> media files, but in the end it's all just semantics.
>
> I don't think I would change the NAR bundle name, because I think
> "nifi-media-nar" establishes it as a place to collect this and other media
> related processors in the future.
>
> Regards,
> Joe
>
> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
> dgoldenb...@hexastax.com
> > wrote:
>
> > Hi Joe,
> >
> > Thanks for all the details.
> >
> > I wanted to propose that I do some of this work so as to go through the
> > full cycle of developing a processor and committing it.
> >
> > Once your changes are merged, I could extend your 'ExtractMediaMetadata'
> > processor to handle the content, in addition to the metadata.
> >
> > We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode
> with 3
> > values: metadataOnly, contentOnly, metadataAndContent.
> >
> > One thing that looks to be a design issue right now is, your changes and
> > the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
> >
> > Would it make sense to have a generic processor
> > ExtractDocumentMetadataAndContent?  Are there enough specifics in the
> > image/video processing stuff to warrant that to be a separate layer;
> > perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might it make
> > sense to rename nifi-media-nar into nifi-text-extract-nar ?
> >
> > Thanks,
> > - Dmitry
> >
> >
> >
> > On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora  wrote:
> >
> > > Dmitry,
> > >
> > > Yeah, I agree, Tika is pretty impressive.  The original ticket,
> NIFI-615
> > > , wanted extraction of
> > > metadata from WAV files, but as I got into it I found Tika so for the
> > same
> > > effort it support

Re: Unable to Copy data from local NIFI to Cluster in AWS

2016-03-30 Thread ambaricloud
Thank you Matt. My configuration is working; now I am able to send data from
Laptop (source NIFI) to AWS NIFI cluster. 
Satya
amabriCloud



--
View this message in context: 
http://apache-nifi-developer-list.39713.n7.nabble.com/Unable-Copy-data-from-local-NIFI-to-Cluster-in-AWS-tp8099p8623.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.


Re: Text and metadata extraction processor

2016-03-30 Thread Joe Skora
Dmitry,

Are you proposing separate filters that determine the mode of processing,
metadata/content/metadataAndContent?  I was thinking of one selection
filters and a static mode switch at the processor instance level, to make
configuration more obvious such that one instance of the processor will
handle a known set of files regardless of the processing mode.

I was thinking it would be useful for the mode switch to support expression
language, but I'm not sure about that since the selection filters will
control what files get processed and it would be harder to configure if the
output flow file could vary between source format and extracted text.  So,
while it might be easy to do, and occasionally useful, I think in normal
use I'd never have a varying mode but would more likely have multiple
processor instances with some routing or selection going on further
upstream.

I wrestled with the naming issue too.  I went with "ExtractMediaAttributes"
over "ExtractDocumentAttributes" because it seemed to represent the broader
context better.  In reality, media files and documents and documents are
media files, but in the end it's all just semantics.

I don't think I would change the NAR bundle name, because I think
"nifi-media-nar" establishes it as a place to collect this and other media
related processors in the future.

Regards,
Joe

On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg  wrote:

> Hi Joe,
>
> Thanks for all the details.
>
> I wanted to propose that I do some of this work so as to go through the
> full cycle of developing a processor and committing it.
>
> Once your changes are merged, I could extend your 'ExtractMediaMetadata'
> processor to handle the content, in addition to the metadata.
>
> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode with 3
> values: metadataOnly, contentOnly, metadataAndContent.
>
> One thing that looks to be a design issue right now is, your changes and
> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>
> Would it make sense to have a generic processor
> ExtractDocumentMetadataAndContent?  Are there enough specifics in the
> image/video processing stuff to warrant that to be a separate layer;
> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might it make
> sense to rename nifi-media-nar into nifi-text-extract-nar ?
>
> Thanks,
> - Dmitry
>
>
>
> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora  wrote:
>
> > Dmitry,
> >
> > Yeah, I agree, Tika is pretty impressive.  The original ticket, NIFI-615
> > , wanted extraction of
> > metadata from WAV files, but as I got into it I found Tika so for the
> same
> > effort it supports the 1,000+ file formats Tika understands.  That new
> > processor called "ExtractMediaMetadata", you can pull that pull PR-252
> >  from GitHub if you want to
> give
> > it a try before it's merged.
> >
> > Extraction content for those 1,000+ formats would be a valuable addition.
> > I see two possible approaches, 1) create a new "ExtractMediaContent"
> > processor that would put the document content in a new flow file, and 2)
> > extend the new "ExtractMediaMetadata" processor so it can extract
> metadata,
> > content, or both.  One combined processor makes sense if it can provide a
> > performance gain, otherwise two complementary processors may make usage
> > easier.
> >
> > I'm glad to help if you want to take a cut at the processor yourself, or
> I
> > can take a crack at it myself if you'd prefer.
> >
> > Don't hesitate to ask questions or share comments and feedback regarding
> > the ExtractMediaMetadata processor or the addition of content handling.
> >
> > Regards,
> > Joe Skora
> >
> > On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
> > dgoldenb...@hexastax.com> wrote:
> >
> > > Thanks, Joe!
> > >
> > > Hi Joe S. - I'm definitely up for discussing and contributing.
> > >
> > > While building search-related ingestion systems, I've seen metadata and
> > > text extraction being done all the time; it's always there and always
> has
> > > to be done for building search indexes.  Beyond that, OCR-related
> > > capabilities are often requested, and the advantage of Tika is that it
> > > supports OCR out of the box.
> > >
> > > - Dmitry
> > >
> > > On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt  wrote:
> > >
> > > > Dmitry,
> > > >
> > > > Another community member (Joe Skora) has a PR outstanding for
> > > > extracting metadata from media files using Tika.  Perhaps it makes
> > > > sense to broaden that to in general extract what Tika can find.  Joe
> -
> > > > perhaps you can discuss your ideas with Dmitry and see if broadening
> > > > is a good idea or if rather domain specific ones make more sense.
> > > >
> > > > This concept of extracting metadata from documents/text files, etc..
> > > > using something like Tika is certainly useful as that then can drive
> > > > nice automated routing decisions.
> > > >
> > > > T