[
https://issues.apache.org/jira/browse/NIFI-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311588#comment-16311588
]
ASF GitHub Bot commented on NIFI-4727:
--------------------------------------
GitHub user alopresto opened a pull request:
https://github.com/apache/nifi/pull/2371
NIFI-4727 Add CountText processor
This new processor performs basic text metrics on incoming flowfile content
(line count, non-empty line count, word count, and character count). Each
metric is independent, and word count can be configured to accept or split on
symbol boundaries.
The performance is fairly decent (a document with ~370k lines / words ~= 80
- 90 ms on a commodity laptop) and handles the text in a streaming manner so as
not to damage the heap.
A sample flow.xml.gz is [posted
here](https://gist.github.com/alopresto/86eb04437c079cf2a48e2aeadf2df7c0) to
allow for local testing (relying on a locally downloaded file or an HTTP GET to
a GitHub hosted file).
Thank you for submitting a contribution to Apache NiFi.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
- [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number
you are trying to resolve? Pay particular attention to the hyphen "-" character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Is your initial contribution a single, squashed commit?
### For code changes:
- [x] Have you ensured that the full suite of tests is executed via mvn
-Pcontrib-check clean install at the root nifi folder?
- [x] Have you written or updated unit tests to verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the LICENSE file, including the main
LICENSE file under nifi-assembly?
- [ ] If applicable, have you updated the NOTICE file, including the main
NOTICE file found under nifi-assembly?
- [x] If adding new Properties, have you added .displayName in addition to
.name (programmatic access) for each of the new properties?
### For documentation related changes:
- [x] Have you ensured that format looks appropriate for the output in
which it is rendered?
### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/alopresto/nifi NIFI-4727
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nifi/pull/2371.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2371
----
commit 3bcc023d7141c5027ddef7738c86f9dde364850b
Author: Andy LoPresto <alopresto.apache@...>
Date: 2018-01-02T19:47:33Z
NIFI-4727 Added CountText processor and unit test (character count not yet
tested).
commit 00847974996fc0e825fd9ee788a61fce7485b95c
Author: Andy LoPresto <alopresto.apache@...>
Date: 2018-01-03T20:11:21Z
NIFI-4727 Fixed character count.
Improved metrics log message.
Added unit tests.
commit e9d48c9905d043ff0f3993f6a33f993c0a6465e6
Author: Andy LoPresto <alopresto@...>
Date: 2018-01-04T13:37:30Z
NIFI-4727 Fixed symbol regex.
Added unit tests.
commit 66bd335c7423712191f200a5b7dbc236e9587fec
Author: Andy LoPresto <alopresto@...>
Date: 2018-01-04T15:41:51Z
NIFI-4727 Fixed checkstyle issues.
Fixed RAT issue with test resource.
Reset metrics to 0 on new run (was maintaining running count before).
Added unit tests.
----
> Create text count processor
> ---------------------------
>
> Key: NIFI-4727
> URL: https://issues.apache.org/jira/browse/NIFI-4727
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Extensions
> Affects Versions: 1.4.0
> Reporter: Andy LoPresto
> Assignee: Andy LoPresto
> Labels: processor, text
>
> A frequent community request is to count (lines/words/characters) in
> arbitrary text. A {{CountTextProcessor}} would provide this functionality
> natively and with solid performance, rather than abusing the {{SplitText}} or
> {{ExecuteScript}} processors.
> It should provide the following functionality (simultaneously, given options):
> * Line count
> * Non-empty line count
> * Word count
> * Character count
> The flowfile content should remain unchanged, and each of the above (if
> indicated) should be added as an attribute.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)