What I’m suggesting is a single processor for both, but instead of using a mode property to determine which bits get extracted, you use the state of the relations on the processor to configure which options tika uses and using a single pass to actually parse metadata into attributes, and content into a new flow file transfer into the parsed relation.
On the tesseract front, it may make sense to do this through a controller service. A PDF processor might be interesting. Are you thinking of something like PDFBox, or tika again? Simon > On 1 Apr 2016, at 01:30, Dmitry Goldenberg <[email protected]> wrote: > > Simon, > > Interesting commentary. The issue that Joe and I have both looked at, with > the splitting of metadata and content extraction, is that if they're split > then the underlying Tika extraction has to process the file twice: once to > pull out the attributes and once to pull out the content. Perhaps it may > be good to add ExtractMetadata and ExtractTextContent in addition to > ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong. > > It seems prudent to provide one wholesome, out-of-the-box extractor > processor with options to extract just metadata, just content, or both > metadata and content. > > I think what I'm hearing is that we need to allow for checking somewhere > for whether text/content has already been extracted by the time we get to > the ExtractMediaAttributes processor - ? If that is the issue then I > believe the user would use RouteOnAttribute and if the content is already > filled in then they'd not route to ExtractMediaAttributes. > > As far as the OCR. Tika internally supports OCR by directing image files > to Tesseract (if Tesseract is installed and configured properly). We've > started talking about how this could be reconciled in the > ExtractMediaAttributes. > > I think that once we have the basic ExtractMediaAttributes, we could add > filters for what files to enable the OCR on, and we'd need to expose a few > config parameters specific to OCR, such as e.g. the location of the > Tesseract installation and the maximum file size on which to attempt the > OCR. Perhaps there can also be a RunOCR processor which would be dedicated > to running OCR. But since Tika already has OCR integrated we'd probably > want to take care of that in the ExtractMediaAttributes configuration. > > Additionally, I've proposed the idea of a ProcessPDF processor which would > ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break > it up into pages and run OCR on each page, then aggregate the extracted > text. > > - Dmitry > > > > On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <[email protected]> wrote: > >> Just a thought… >> >> To keep consistent with other Nifi Parse patterns, would it make sense to >> based the extraction of content on the presence of a relation. So your tika >> processor would have an original relation which would have meta data >> attached as attributed, and an extracted relation which would have the >> metadata and the processed content (text from OCRed image for example). >> That way you can just use context.hasConnection(relationship) to determine >> whether to enable the tika content processing. >> >> This seems more idiomatic than a mode flag. >> >> Simon >> >>> On 31 Mar 2016, at 19:48, Joe Skora <[email protected]> wrote: >>> >>> Dmitry, >>> >>> I think we're good. I was confused because "XXX_METADATA MIMETYPE >> FILTER" >>> entries referred to some MIME type of the metadata, but you meant to use >>> the file's MIME type to select what files have metadata extracted. >>> >>> Sorry, about that, I think we are on the same page. >>> >>> Joe >>> >>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg < >>> [email protected]> wrote: >>> >>>> Hi Joe, >>>> >>>> I think if we have the filters in place then there's no need for the >> 'mode' >>>> enum, as the filters themselves guide the processor in deciding whether >>>> metadata and/or content is extracted for a given input file. >>>> >>>> Agreed on the handling of archives as a separate processor (template, >> seems >>>> like). >>>> >>>> I think it's easiest to do both metadata and/or content in one processor >>>> since it can tell Tika whether to extract metadata and/or content, in >> one >>>> pass over the file bytes (as you pointed out). >>>> >>>> Agreed on the exclusions trumping inclusions; I think that makes sense. >>>> >>>>>> We will only have a mimetype for the original flow file itself so I'm >>>> not sure about the metadata mimetype filter. >>>> >>>> I'm not sure where there might be an issue here. The metadata MIME type >>>> filter tells the processor for which MIME types to perform the metadata >>>> extraction. For instance, extract metadata for images and videos, only. >>>> This could possibly be coupled with an exclusion filter for content that >>>> says, don't try to extract content from images and videos. >>>> >>>> I think with the six filters we get all the bases covered: >>>> >>>> 1. include metadata? -- >>>> 1. yes -- >>>> 1. determine the inclusion of metadata by filename pattern >>>> 2. determine the inclusion of metadata by MIME type pattern >>>> 2. no -- >>>> 1. determine the exclusion of metadata by filename pattern >>>> 2. determine the exclusion of metadata by MIME type pattern >>>> 2. include content? -- >>>> 1. yes -- >>>> 1. determine the inclusion of content by filename pattern >>>> 2. determine the inclusion of content by MIME type pattern >>>> 2. no -- >>>> 1. determine the exclusion of content by filename pattern >>>> 2. determine the exclusion of content by MIME type pattern >>>> >>>> Does this work? >>>> >>>> Thanks, >>>> - Dmitry >>>> >>>> >>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <[email protected]> wrote: >>>> >>>>> Dmitry, >>>>> >>>>> Looking at this and your prior email. >>>>> >>>>> >>>>> 1. I can see "extract metadata only" being as popular as "extract >>>>> metadata and content". It will all depend on the type of media, for >>>>> audio/video files adding the metadata to the flow file is enough but >>>> for >>>>> Word, PDF, etc. files the content may be wanted as well. >>>>> 2. After thinking about it, I agree on an enum for mode. >>>>> 3. I think any handling of zips or archive files should be handled by >>>>> another processor, that keeps this processor cleaner and improves its >>>>> ability for re-use. >>>>> 4. I like the addition of exclude filters but I'm not sure about >>>> adding >>>>> content filters. We will only have a mimetype for the original flow >>>>> file >>>>> itself so I'm not sure about the metadata mimetype filter. I think >>>>> content >>>>> filtering may be best left for another downstream processor, but it >>>>> might >>>>> be run faster if included here since the entire content will be >>>> handled >>>>> during extraction. If the content filters are implemented, for >>>>> performance >>>>> they need to short circuit so that if the property is not set or is >>>> set >>>>> to >>>>> ".*" they don't evaluate the regex. >>>>> 1. FILENAME_FILTER - selects flow files to process based on filename >>>>> matching regex. (exists) >>>>> 2. MIMETYPE_FILTER - selects flow files to process based on >>>> mimetype >>>>> matching regex. (exists) >>>>> 3. FILENAME_EXCLUDE - excludes already selected flow files from >>>>> processing based on filename matching regex. (new) >>>>> 4. MIMETYPE_EXCLUDE - excludes already selected flow files from >>>>> processing based on mimetype matching regex. (new) >>>>> 5. CONTENT_FILTER (optional) - selects flow files for output based >>>> on >>>>> extracted content matching regex. (new) >>>>> 6. CONTENT_EXCLUDE (optional) - excludes flow files from output >>>> based >>>>> on extracted content matching regex. (new) >>>>> 5. As indicated in the descriptions in #4, I don't think overlapping >>>>> filters are an error, instead excludes should take precedence over >>>>> includes. Then I can include a domain (like A*) but exclude sub-sets >>>>> (like >>>>> AXYZ*). >>>>> >>>>> I'm sure there's something we missed, but I think that covers most of >> it. >>>>> >>>>> Regards, >>>>> Joe >>>>> >>>>> >>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg < >>>>> [email protected] >>>>>> wrote: >>>>> >>>>>> Joe, >>>>>> >>>>>> Upon some thinking, I've started wondering whether all the cases can >> be >>>>>> covered by the following filters: >>>>>> >>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input >>>>>> files get their content extracted, by file name >>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which >> input >>>>>> files get their metadata extracted, by file name >>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input >>>>>> files get their content extracted, by MIME type >>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which >> input >>>>>> files get their metadata extracted, by MIME type >>>>>> >>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input >>>>>> files do NOT get their content extracted, by file name >>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which >> input >>>>>> files do NOT get their metadata extracted, by file name >>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input >>>>>> files do NOT get their content extracted, by MIME type >>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which >> input >>>>>> files do NOT get their metadata extracted, by MIME type >>>>>> >>>>>> I believe this gets all the bases covered. At processor init time, we >>>> can >>>>>> analyze the inclusions vs. exclusions; any overlap would cause a >>>>>> configuration error. >>>>>> >>>>>> Let me know what you think, thanks. >>>>>> - Dmitry >>>>>> >>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Joe, >>>>>>> >>>>>>> I follow your reasoning on the semantics of "media". One might argue >>>>>> that >>>>>>> media files are a case of "document" or that a document is a case of >>>>>>> "media". >>>>>>> >>>>>>> I'm not proposing filters for the mode of processing, I'm proposing a >>>>>>> flag/enum with 3 values: >>>>>>> >>>>>>> A) extract metadata only; >>>>>>> B) extract content only and place it into the flowfile content; >>>>>>> C) extract both metadata and content. >>>>>>> >>>>>>> I think the default should be C, to extract both. At least in my >>>>>>> experience most flows I've dealt with were interested in extracting >>>>> both. >>>>>>> >>>>>>> I don't see how this mode would benefit from being expression driven >>>> - >>>>> ? >>>>>>> >>>>>>> I think we can add this enum mode and have the basic use case >>>> covered. >>>>>>> >>>>>>> Additionally, further down the line, I was thinking we could ponder >>>> the >>>>>>> following (these have been essential in search engine ingestion): >>>>>>> >>>>>>> 1. Extraction from compressed files/archives. How would >>>>> UnpackContent >>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a zip >>>>>> file as >>>>>>> input and want to crack it open and unravel it recursively; it may >>>>>> have >>>>>>> other, nested zips inside, along with other documents. One way to >>>>>> handle >>>>>>> this is to treat the whole archive as one document and merge all >>>>>> attributes >>>>>>> into one FlowFile. The other way would be to treat each archive >>>>>> entry as >>>>>>> its own flow file and keep a pointer back at the parent archive. >>>>> Yet >>>>>>> another case is when the user might want to only extract the >>>> 'leaf' >>>>>> entries >>>>>>> and discard any parent container archives. >>>>>>> >>>>>>> 2. Attachments and embeddings. Users may want to treat any >>>> attached >>>>> or >>>>>>> embedded files as separate flowfiles with perhaps pointers back to >>>>> the >>>>>>> parent files. This definitely warrants a filter. Oftentimes Office >>>>>>> documents have 'media' embeddings which are often not of interest, >>>>>>> especially for the case of ingesting into a search engine. >>>>>>> >>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the >>>>>>> 'image'/scanned PDF's for which Tika won't extract text. >>>>>>> >>>>>>> I'd like to understand how much of this is already supported in NiFi >>>>> and >>>>>>> if not I'd volunteer/collaborate to implement some of this. >>>>>>> >>>>>>> - Dmitry >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> wrote: >>>>>>> >>>>>>>> Dmitry, >>>>>>>> >>>>>>>> Are you proposing separate filters that determine the mode of >>>>>> processing, >>>>>>>> metadata/content/metadataAndContent? I was thinking of one >>>> selection >>>>>>>> filters and a static mode switch at the processor instance level, to >>>>>> make >>>>>>>> configuration more obvious such that one instance of the processor >>>>> will >>>>>>>> handle a known set of files regardless of the processing mode. >>>>>>>> >>>>>>>> I was thinking it would be useful for the mode switch to support >>>>>>>> expression >>>>>>>> language, but I'm not sure about that since the selection filters >>>> will >>>>>>>> control what files get processed and it would be harder to configure >>>>> if >>>>>>>> the >>>>>>>> output flow file could vary between source format and extracted >>>> text. >>>>>> So, >>>>>>>> while it might be easy to do, and occasionally useful, I think in >>>>> normal >>>>>>>> use I'd never have a varying mode but would more likely have >>>> multiple >>>>>>>> processor instances with some routing or selection going on further >>>>>>>> upstream. >>>>>>>> >>>>>>>> I wrestled with the naming issue too. I went with >>>>>>>> "ExtractMediaAttributes" >>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent the >>>>>>>> broader >>>>>>>> context better. In reality, media files and documents and documents >>>>> are >>>>>>>> media files, but in the end it's all just semantics. >>>>>>>> >>>>>>>> I don't think I would change the NAR bundle name, because I think >>>>>>>> "nifi-media-nar" establishes it as a place to collect this and other >>>>>> media >>>>>>>> related processors in the future. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Joe >>>>>>>> >>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg < >>>>>>>> [email protected] >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Joe, >>>>>>>>> >>>>>>>>> Thanks for all the details. >>>>>>>>> >>>>>>>>> I wanted to propose that I do some of this work so as to go >>>> through >>>>>> the >>>>>>>>> full cycle of developing a processor and committing it. >>>>>>>>> >>>>>>>>> Once your changes are merged, I could extend your >>>>>> 'ExtractMediaMetadata' >>>>>>>>> processor to handle the content, in addition to the metadata. >>>>>>>>> >>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a >>>> mode >>>>>>>> with 3 >>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent. >>>>>>>>> >>>>>>>>> One thing that looks to be a design issue right now is, your >>>> changes >>>>>> and >>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.) >>>>>>>>> >>>>>>>>> Would it make sense to have a generic processor >>>>>>>>> ExtractDocumentMetadataAndContent? Are there enough specifics in >>>>> the >>>>>>>>> image/video processing stuff to warrant that to be a separate >>>> layer; >>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ? Might >>>> it >>>>>> make >>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> - Dmitry >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]> >>>>> wrote: >>>>>>>>> >>>>>>>>>> Dmitry, >>>>>>>>>> >>>>>>>>>> Yeah, I agree, Tika is pretty impressive. The original ticket, >>>>>>>> NIFI-615 >>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted >>>>> extraction >>>>>>>> of >>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so >>>> for >>>>>> the >>>>>>>>> same >>>>>>>>>> effort it supports the 1,000+ file formats Tika understands. >>>> That >>>>>> new >>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull >>>>>> PR-252 >>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you >>>> want >>>>>> to >>>>>>>>> give >>>>>>>>>> it a try before it's merged. >>>>>>>>>> >>>>>>>>>> Extraction content for those 1,000+ formats would be a valuable >>>>>>>> addition. >>>>>>>>>> I see two possible approaches, 1) create a new >>>>> "ExtractMediaContent" >>>>>>>>>> processor that would put the document content in a new flow >>>> file, >>>>>> and >>>>>>>> 2) >>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can >>>> extract >>>>>>>>> metadata, >>>>>>>>>> content, or both. One combined processor makes sense if it can >>>>>>>> provide a >>>>>>>>>> performance gain, otherwise two complementary processors may >>>> make >>>>>>>> usage >>>>>>>>>> easier. >>>>>>>>>> >>>>>>>>>> I'm glad to help if you want to take a cut at the processor >>>>>> yourself, >>>>>>>> or >>>>>>>>> I >>>>>>>>>> can take a crack at it myself if you'd prefer. >>>>>>>>>> >>>>>>>>>> Don't hesitate to ask questions or share comments and feedback >>>>>>>> regarding >>>>>>>>>> the ExtractMediaMetadata processor or the addition of content >>>>>>>> handling. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Joe Skora >>>>>>>>>> >>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks, Joe! >>>>>>>>>>> >>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing. >>>>>>>>>>> >>>>>>>>>>> While building search-related ingestion systems, I've seen >>>>>> metadata >>>>>>>> and >>>>>>>>>>> text extraction being done all the time; it's always there and >>>>>>>> always >>>>>>>>> has >>>>>>>>>>> to be done for building search indexes. Beyond that, >>>>> OCR-related >>>>>>>>>>> capabilities are often requested, and the advantage of Tika is >>>>>> that >>>>>>>> it >>>>>>>>>>> supports OCR out of the box. >>>>>>>>>>> >>>>>>>>>>> - Dmitry >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt < >>>> [email protected]> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Dmitry, >>>>>>>>>>>> >>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding >>>> for >>>>>>>>>>>> extracting metadata from media files using Tika. Perhaps it >>>>>> makes >>>>>>>>>>>> sense to broaden that to in general extract what Tika can >>>>> find. >>>>>>>> Joe >>>>>>>>> - >>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if >>>>>>>> broadening >>>>>>>>>>>> is a good idea or if rather domain specific ones make more >>>>>> sense. >>>>>>>>>>>> >>>>>>>>>>>> This concept of extracting metadata from documents/text >>>> files, >>>>>>>> etc.. >>>>>>>>>>>> using something like Tika is certainly useful as that then >>>> can >>>>>>>> drive >>>>>>>>>>>> nice automated routing decisions. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Joe >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg >>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I see that the ExtractText processor extracts text using >>>>>> regex. >>>>>>>>>>>>> >>>>>>>>>>>>> What about a processor that extracts text and metadata >>>> from >>>>>>>>> incoming >>>>>>>>>>>>> files? That doesn't seem to exist - but perhaps I didn't >>>>>> quite >>>>>>>>> look >>>>>>>>>> in >>>>>>>>>>>> the >>>>>>>>>>>>> right spots. >>>>>>>>>>>>> >>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it, >>>>>> using >>>>>>>>>> Apache >>>>>>>>>>>>> Tika. There may also be a couple of related processors to >>>>>> that. >>>>>>>>>>>>> >>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> - Dmitry >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >>
