Re: Maintain character encoding in workflow

Joe Witt Thu, 30 Apr 2015 11:37:28 -0700

Adam,

If you believe that a process in the flow is manipulating the
characters you can use the built in provenance, archive, and data
viewer functions.  We need to document how to set this stuff up.  But
for now if you configure the nifi.properties as follows and restart
you'll have the good stuff.  This is all assuming you're on the latest
develop branch codebase:

Set the following properties to the following values (these are just examples):

nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository
nifi.content.claim.max.appendable.size=10 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.pub1=/your/path/for/content
nifi.content.repository.archive.max.retention.period=3 hours
nifi.content.repository.archive.max.usage.percentage=30%
nifi.content.repository.archive.enabled=true
nifi.content.repository.always.sync=false
nifi.content.viewer.url=/nifi-content-viewer/

nifi.provenance.repository.directory.prov1=/your/path/for/prov
nifi.provenance.repository.max.storage.time=24 hours
nifi.provenance.repository.max.storage.size=1 GB
nifi.provenance.repository.rollover.time=30 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=6
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.journal.count=16
# Comma-separated list of fields. Fields that are not indexed will not
be searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
AlternateIdentifierURI, ContentType, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
Filename, ProcessorID
# FlowFile Attributes that should be indexed and made searchable
nifi.provenance.repository.indexed.attributes=
# Large values for the shard size will result in more Java heap usage
when searching the Provenance Repository
# but should provide better performance
nifi.provenance.repository.index.shard.size=500 MB

Basically the things different from default here would be:
nifi.content.viewer.url=/nifi-content-viewer/
nifi.content.repository.archive.max.retention.period=3 hours
nifi.content.repository.archive.max.usage.percentage=30%
nifi.content.repository.archive.enabled=true

Anyway what this does is tells nifi to hang onto the content until it
has to actually delete it from disk.  It then allows you to look at
the provenance trail of any object and then you can 'view content' in
our built-in content viewer.  You can use that to step by step review
the content as it goes through the flow.

We must make a nice blog out of this with screenshots.  It is a really
powerful feature.

If that doesn't get you the info you need please let us know.

Thanks
Joe

On Thu, Apr 30, 2015 at 2:20 PM, Adam Estrada <[email protected]> wrote:
> All,
>
> I am coming across an issue where my unicode characters are being converted
> to their unicode point representations (as javascript escapes) like this
> "\u0432\u0430\u0436\u043d\u0435\u0435". This is happening with Twitter data
> that is collected using the Twitter processor. How can I debug my workflow
> to figure out where the characters are being converted?
>
> Thanks,
> Adam

Re: Maintain character encoding in workflow

Reply via email to