Joe, Upon some thinking, I've started wondering whether all the cases can be covered by the following filters:
INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files get their content extracted, by file name INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files get their metadata extracted, by file name INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files get their content extracted, by MIME type INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files get their metadata extracted, by MIME type EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files do NOT get their content extracted, by file name EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by file name EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their content extracted, by MIME type EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by MIME type I believe this gets all the bases covered. At processor init time, we can analyze the inclusions vs. exclusions; any overlap would cause a configuration error. Let me know what you think, thanks. - Dmitry On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg < [email protected]> wrote: > Hi Joe, > > I follow your reasoning on the semantics of "media". One might argue that > media files are a case of "document" or that a document is a case of > "media". > > I'm not proposing filters for the mode of processing, I'm proposing a > flag/enum with 3 values: > > A) extract metadata only; > B) extract content only and place it into the flowfile content; > C) extract both metadata and content. > > I think the default should be C, to extract both. At least in my > experience most flows I've dealt with were interested in extracting both. > > I don't see how this mode would benefit from being expression driven - ? > > I think we can add this enum mode and have the basic use case covered. > > Additionally, further down the line, I was thinking we could ponder the > following (these have been essential in search engine ingestion): > > 1. Extraction from compressed files/archives. How would UnpackContent > work with ExtractMediaAttributes? Use-case being, we've got a zip file as > input and want to crack it open and unravel it recursively; it may have > other, nested zips inside, along with other documents. One way to handle > this is to treat the whole archive as one document and merge all attributes > into one FlowFile. The other way would be to treat each archive entry as > its own flow file and keep a pointer back at the parent archive. Yet > another case is when the user might want to only extract the 'leaf' entries > and discard any parent container archives. > > 2. Attachments and embeddings. Users may want to treat any attached or > embedded files as separate flowfiles with perhaps pointers back to the > parent files. This definitely warrants a filter. Oftentimes Office > documents have 'media' embeddings which are often not of interest, > especially for the case of ingesting into a search engine. > > 3. PDF. For PDF's, we can do OCR. This is important for the > 'image'/scanned PDF's for which Tika won't extract text. > > I'd like to understand how much of this is already supported in NiFi and > if not I'd volunteer/collaborate to implement some of this. > > - Dmitry > > > On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> wrote: > >> Dmitry, >> >> Are you proposing separate filters that determine the mode of processing, >> metadata/content/metadataAndContent? I was thinking of one selection >> filters and a static mode switch at the processor instance level, to make >> configuration more obvious such that one instance of the processor will >> handle a known set of files regardless of the processing mode. >> >> I was thinking it would be useful for the mode switch to support >> expression >> language, but I'm not sure about that since the selection filters will >> control what files get processed and it would be harder to configure if >> the >> output flow file could vary between source format and extracted text. So, >> while it might be easy to do, and occasionally useful, I think in normal >> use I'd never have a varying mode but would more likely have multiple >> processor instances with some routing or selection going on further >> upstream. >> >> I wrestled with the naming issue too. I went with >> "ExtractMediaAttributes" >> over "ExtractDocumentAttributes" because it seemed to represent the >> broader >> context better. In reality, media files and documents and documents are >> media files, but in the end it's all just semantics. >> >> I don't think I would change the NAR bundle name, because I think >> "nifi-media-nar" establishes it as a place to collect this and other media >> related processors in the future. >> >> Regards, >> Joe >> >> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg < >> [email protected] >> > wrote: >> >> > Hi Joe, >> > >> > Thanks for all the details. >> > >> > I wanted to propose that I do some of this work so as to go through the >> > full cycle of developing a processor and committing it. >> > >> > Once your changes are merged, I could extend your 'ExtractMediaMetadata' >> > processor to handle the content, in addition to the metadata. >> > >> > We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a mode >> with 3 >> > values: metadataOnly, contentOnly, metadataAndContent. >> > >> > One thing that looks to be a design issue right now is, your changes and >> > the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.) >> > >> > Would it make sense to have a generic processor >> > ExtractDocumentMetadataAndContent? Are there enough specifics in the >> > image/video processing stuff to warrant that to be a separate layer; >> > perhaps a subclass of ExtractDocumentMetadataAndContent ? Might it make >> > sense to rename nifi-media-nar into nifi-text-extract-nar ? >> > >> > Thanks, >> > - Dmitry >> > >> > >> > >> > On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]> wrote: >> > >> > > Dmitry, >> > > >> > > Yeah, I agree, Tika is pretty impressive. The original ticket, >> NIFI-615 >> > > <https://issues.apache.org/jira/browse/NIFI-615>, wanted extraction >> of >> > > metadata from WAV files, but as I got into it I found Tika so for the >> > same >> > > effort it supports the 1,000+ file formats Tika understands. That new >> > > processor called "ExtractMediaMetadata", you can pull that pull PR-252 >> > > <https://github.com/apache/nifi/pull/252> from GitHub if you want to >> > give >> > > it a try before it's merged. >> > > >> > > Extraction content for those 1,000+ formats would be a valuable >> addition. >> > > I see two possible approaches, 1) create a new "ExtractMediaContent" >> > > processor that would put the document content in a new flow file, and >> 2) >> > > extend the new "ExtractMediaMetadata" processor so it can extract >> > metadata, >> > > content, or both. One combined processor makes sense if it can >> provide a >> > > performance gain, otherwise two complementary processors may make >> usage >> > > easier. >> > > >> > > I'm glad to help if you want to take a cut at the processor yourself, >> or >> > I >> > > can take a crack at it myself if you'd prefer. >> > > >> > > Don't hesitate to ask questions or share comments and feedback >> regarding >> > > the ExtractMediaMetadata processor or the addition of content >> handling. >> > > >> > > Regards, >> > > Joe Skora >> > > >> > > On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg < >> > > [email protected]> wrote: >> > > >> > > > Thanks, Joe! >> > > > >> > > > Hi Joe S. - I'm definitely up for discussing and contributing. >> > > > >> > > > While building search-related ingestion systems, I've seen metadata >> and >> > > > text extraction being done all the time; it's always there and >> always >> > has >> > > > to be done for building search indexes. Beyond that, OCR-related >> > > > capabilities are often requested, and the advantage of Tika is that >> it >> > > > supports OCR out of the box. >> > > > >> > > > - Dmitry >> > > > >> > > > On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <[email protected]> >> wrote: >> > > > >> > > > > Dmitry, >> > > > > >> > > > > Another community member (Joe Skora) has a PR outstanding for >> > > > > extracting metadata from media files using Tika. Perhaps it makes >> > > > > sense to broaden that to in general extract what Tika can find. >> Joe >> > - >> > > > > perhaps you can discuss your ideas with Dmitry and see if >> broadening >> > > > > is a good idea or if rather domain specific ones make more sense. >> > > > > >> > > > > This concept of extracting metadata from documents/text files, >> etc.. >> > > > > using something like Tika is certainly useful as that then can >> drive >> > > > > nice automated routing decisions. >> > > > > >> > > > > Thanks >> > > > > Joe >> > > > > >> > > > > On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg >> > > > > <[email protected]> wrote: >> > > > > > Hi, >> > > > > > >> > > > > > I see that the ExtractText processor extracts text using regex. >> > > > > > >> > > > > > What about a processor that extracts text and metadata from >> > incoming >> > > > > > files? That doesn't seem to exist - but perhaps I didn't quite >> > look >> > > in >> > > > > the >> > > > > > right spots. >> > > > > > >> > > > > > If that doesn't exist I'd like to implement and commit it, using >> > > Apache >> > > > > > Tika. There may also be a couple of related processors to that. >> > > > > > >> > > > > > Thoughts? >> > > > > > >> > > > > > Thanks, >> > > > > > - Dmitry >> > > > > >> > > > >> > > >> > >> > >
