[
https://issues.apache.org/jira/browse/NIFI-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867781#comment-17867781
]
Joe Witt commented on NIFI-12709:
---------------------------------
This is actually touching on an increasingly key problem and opportunity for us
in the Apache NiFi community.
We are already excellent at capturing both content and context/metadata. And
doing so allows users to build super powerful flows including those which fuel
replication use cases. These types of use cases and the key needs are also
valuable as we're finding for Generative AI ingest.
To that end us getting more standardized around which fields we capture and
making sure key components write these attributes and others use them will
increase how easy it is to build powerful flows.
We have a lot of this but it might not be well organized at this point.
Things like
- filename
- full url at which the file was found
- file size
- mime type/content type
- creator
- groups that have access (and whether r/w/x)
- users that have access (and whether r/w/x)
the sorts of things here
https://docs.oracle.com/javase/8/docs/api/java/nio/file/attribute/PosixFileAttributes.html
Except we dont want to tie to any single mechanism.
Dont want to complicate what was mentioned with this JIRA. Just saying there
is a key opportunity forming here and we'll want to build a bigger
picture/standard answer.
UnpackContent is a unique case because its output isnt the source data but
rather some derived result of pulling individual items out but that we do
need/want to carry these attributes into each resulting flowfile.
> UnpackContent should save attributes from the zip entries as flowfile
> attributes where possible
> -----------------------------------------------------------------------------------------------
>
> Key: NIFI-12709
> URL: https://issues.apache.org/jira/browse/NIFI-12709
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Joe Witt
> Priority: Major
>
> In an email from Jan 31st to users list titled 'ExecuteStreamCommand failing
> to unzip incoming flowfiles'
> Issue is that UnpackContent doesn't capture much useful metadata. The user
> wants last modified date which is easily available, but also creator,
> creation time, and owner which are less obviously avaialble at least not
> consistently. But there is a concept of extra fields we can extract metadata
> from. We have those same fields available from Tar files so it is natural
> users would also want these. Given their names aren't standard though I see
> why Tar is the only one we currently say we support pulling those for. If we
> at least captured the metadata then flow builders can use it in their flows
> as they wish whereas right now we dont expose that information.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)