Got it. What's the typical JIRA ticket triage process like within NiFi? I'm curious as to how consensus is built around designs, ticket assignments, and what goes into a release.
On Fri, Apr 1, 2016 at 10:33 AM, Mark Payne <[email protected]> wrote: > As far I know, the processors haven't made it into any release yet. If > that is the case, > then we could just remove those properties all together and it's easy. > > If they have already been released, then we would need to ensure that the > processor > is invalid on startup (it doesn't accept those as dynamic properties) and > then we update > the migration guide to explain how to obtain the same behavior. > > But either way, we can definitely remove the properties if it's determined > that there is not > a good enough reason to keep them in. > > -Mark > > > > On Apr 1, 2016, at 10:10 AM, Dmitry Goldenberg <[email protected]> > wrote: > > > > Hi Mark, > > > > That is a good point. It also has crossed my mind. AFAIK, > > ExtractMediaAttributes already has a couple of similar filters on it; Joe > > S., please correct me if I'm wrong. I merely suggested that we extend > > these filters. > > > > I'd have to agree with your points, Mark, that it's cleaner to keep the > > conditionals separate, on RouteOnAttribute and the like. > > > > If that is the consensus then I believe we're back to the idea of a > "mode" > > configuration on ExtractMediaAttributes, with 3 values: a) > > extractMetadataOnly, b) extractContentOnly, c) extractMetadataAndContent. > > As an alternative we have also considered rolling 3 separate processors: > > ExtractMetadata, ExtractContent, and ExtractMetadataAndContent. Given > that > > ExtractMediaAttributes already exists, I think it may be easiest to roll > > with the new "mode" config parameter. > > > > One question then is also, what to do with the filters that are already > on > > ExtractMediaAttributes - ? Should they still be there? > > > > BTW, I've filed the following JIRA tickets related to the topics we've > been > > discussing: > > > > Extract metadata and text - NIFI1717 > > <https://issues.apache.org/jira/browse/NIFI-1717> > > PerformOCR - NIFI1718 <https://issues.apache.org/jira/browse/NIFI-1718> > > ProcessPDF - NIFI1719 <https://issues.apache.org/jira/browse/NIFI-1719> > > > > I'll propagate more info into those as we discuss things more. > > > > Mark, could you take a look at: NIFI1716 > > <https://issues.apache.org/jira/browse/NIFI-1716>. This is a separate > > topic so we could create a separate discussion thread for the CSV > splitter. > > > > Thanks, > > - Dmitry > > > > > > On Fri, Apr 1, 2016 at 9:06 AM, Mark Payne <[email protected]> wrote: > > > >> Dmitry, > >> > >> I would be a bit concerned about providing options for filters that > >> include and > >> exclude certain things. I believe that if you send a FlowFile to the > >> Processor, > >> then the Processor should do its thing. If you want to filter out which > >> FlowFiles > >> have their content extracted, for example, I would suggest using a > >> Processor > >> like RouteOnAttribute to ensure that only the appropriate FlowFiles are > >> processed > >> by the ExtractMediaMetadata processor. > >> > >> This allows the metadata extraction processor to focus purely on > extracting > >> metadata and doesn't have to deal with all of the logic of filtering > >> things out. The logic > >> for filtering things out is almost guaranteed to grow much more complex > as > >> people > >> start to use this more and more. NiFi already provides several > route-based > >> processors > >> to allow for a great deal of flexibility with this type of logic > >> (RouteOnAttribute, RouteOnContent, > >> ScanAttribute, ScanContent, etc.). > >> > >> Thanks > >> -Mark > >> > >> > >> > >>> On Apr 1, 2016, at 12:55 AM, Dmitry Goldenberg < > [email protected]> > >> wrote: > >>> > >>> Simon, > >>> > >>> I believe we've moved on past the 'mode' option and have now switched > to > >>> talking about how the include/exclude filters, for metadata and > content, > >> on > >>> the one hand side, and filename or MIME type based, on the other hand > >> side, > >>> would drive whether meta, content, or both would get extracted. > >>> > >>> For example, a user could configure the ExtractMediaAttributes > processor > >> to > >>> extract metadata for all image files (but not content), extract content > >>> only for plain text documents (but no metadata), or both meta and > content > >>> for documents with an extension ".pqr", based on the filename. > >>> > >>> Could you elaborate on your vision of how relationships could "drive" > >> this > >>> type of functionality? Joe has already built some of the filtering > into > >>> the processor; I just suggested to extend that further, and we get all > >> the > >>> bases covered. > >>> > >>> I'm not sure I followed your comment on the extracted content being > >>> transferred into a new FlowFile. My thoughts were that the extracted > >>> content would be inserted into a new, dedicated field, called for > >> example, > >>> "text", on *the same* FlowFile. I imagine that for a lot of use-cases, > >>> especially data ingestion into a search engine, the extracted > attributes > >>> *and* the extracted text must travel together as part of the ingested > >>> document, with the original flowfile-content most likely getting > dropped > >> on > >>> the way into the index. > >>> > >>> I guess an alternative could be to have an option to represent the > >>> extraction results as a new document, and an option to drop the > original, > >>> and an option to copy the original's attributes onto the new doc. Seems > >>> rather complex. I like the "in-place" extraction. > >>> > >>> Could you also elaborate on how a controller service would handle OCR? > >>> When a document floats into ExtractMediaAttributes, assuming Tesseract > is > >>> installed properly, Tika will already automatically fire off OCR. > Unless > >>> we turn that off and cause OCR to only be supported via this service. > >> I'm > >>> tempted to say why don't we just let Tika do its job for all cases, OCR > >>> included. Caveat being that OCR is expensive and it would be nice to > >> have > >>> ways of ensuring it has enough resources and doesn't bog the flow down. > >>> > >>> For the PDF processor, I'm thinking, yes, PDFBox to break it up into > >> pages > >>> and then apply Tika page by page, then aggregate the output together, > >> with > >>> a configurable max of up to N pages per document to process (due to how > >>> slow OCR is). I already have a prototype of this going, I'll file a > JIRA > >>> ticket for this feature. > >>> > >>> - Dmitry > >>> > >>> > >>> > >>> On Thu, Mar 31, 2016 at 8:43 PM, Simon Ball <[email protected]> > >> wrote: > >>> > >>>> What I’m suggesting is a single processor for both, but instead of > >> using a > >>>> mode property to determine which bits get extracted, you use the state > >> of > >>>> the relations on the processor to configure which options tika uses > and > >>>> using a single pass to actually parse metadata into attributes, and > >> content > >>>> into a new flow file transfer into the parsed relation. > >>>> > >>>> On the tesseract front, it may make sense to do this through a > >> controller > >>>> service. > >>>> > >>>> A PDF processor might be interesting. Are you thinking of something > like > >>>> PDFBox, or tika again? > >>>> > >>>> Simon > >>>> > >>>> > >>>>> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <[email protected] > > > >>>> wrote: > >>>>> > >>>>> Simon, > >>>>> > >>>>> Interesting commentary. The issue that Joe and I have both looked > at, > >>>> with > >>>>> the splitting of metadata and content extraction, is that if they're > >>>> split > >>>>> then the underlying Tika extraction has to process the file twice: > once > >>>> to > >>>>> pull out the attributes and once to pull out the content. Perhaps it > >> may > >>>>> be good to add ExtractMetadata and ExtractTextContent in addition to > >>>>> ExtractMediaAttributes - ? Seems kind of an overkill but I may be > >> wrong. > >>>>> > >>>>> It seems prudent to provide one wholesome, out-of-the-box extractor > >>>>> processor with options to extract just metadata, just content, or > both > >>>>> metadata and content. > >>>>> > >>>>> I think what I'm hearing is that we need to allow for checking > >> somewhere > >>>>> for whether text/content has already been extracted by the time we > get > >> to > >>>>> the ExtractMediaAttributes processor - ? If that is the issue then I > >>>>> believe the user would use RouteOnAttribute and if the content is > >> already > >>>>> filled in then they'd not route to ExtractMediaAttributes. > >>>>> > >>>>> As far as the OCR. Tika internally supports OCR by directing image > >> files > >>>>> to Tesseract (if Tesseract is installed and configured properly). > >> We've > >>>>> started talking about how this could be reconciled in the > >>>>> ExtractMediaAttributes. > >>>>> > >>>>> I think that once we have the basic ExtractMediaAttributes, we could > >> add > >>>>> filters for what files to enable the OCR on, and we'd need to expose > a > >>>> few > >>>>> config parameters specific to OCR, such as e.g. the location of the > >>>>> Tesseract installation and the maximum file size on which to attempt > >> the > >>>>> OCR. Perhaps there can also be a RunOCR processor which would be > >>>> dedicated > >>>>> to running OCR. But since Tika already has OCR integrated we'd > >> probably > >>>>> want to take care of that in the ExtractMediaAttributes > configuration. > >>>>> > >>>>> Additionally, I've proposed the idea of a ProcessPDF processor which > >>>> would > >>>>> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would > >>>> break > >>>>> it up into pages and run OCR on each page, then aggregate the > extracted > >>>>> text. > >>>>> > >>>>> - Dmitry > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <[email protected]> > >>>> wrote: > >>>>> > >>>>>> Just a thought… > >>>>>> > >>>>>> To keep consistent with other Nifi Parse patterns, would it make > sense > >>>> to > >>>>>> based the extraction of content on the presence of a relation. So > your > >>>> tika > >>>>>> processor would have an original relation which would have meta data > >>>>>> attached as attributed, and an extracted relation which would have > the > >>>>>> metadata and the processed content (text from OCRed image for > >> example). > >>>>>> That way you can just use context.hasConnection(relationship) to > >>>> determine > >>>>>> whether to enable the tika content processing. > >>>>>> > >>>>>> This seems more idiomatic than a mode flag. > >>>>>> > >>>>>> Simon > >>>>>> > >>>>>>> On 31 Mar 2016, at 19:48, Joe Skora <[email protected]> wrote: > >>>>>>> > >>>>>>> Dmitry, > >>>>>>> > >>>>>>> I think we're good. I was confused because "XXX_METADATA MIMETYPE > >>>>>> FILTER" > >>>>>>> entries referred to some MIME type of the metadata, but you meant > to > >>>> use > >>>>>>> the file's MIME type to select what files have metadata extracted. > >>>>>>> > >>>>>>> Sorry, about that, I think we are on the same page. > >>>>>>> > >>>>>>> Joe > >>>>>>> > >>>>>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Hi Joe, > >>>>>>>> > >>>>>>>> I think if we have the filters in place then there's no need for > the > >>>>>> 'mode' > >>>>>>>> enum, as the filters themselves guide the processor in deciding > >>>> whether > >>>>>>>> metadata and/or content is extracted for a given input file. > >>>>>>>> > >>>>>>>> Agreed on the handling of archives as a separate processor > >> (template, > >>>>>> seems > >>>>>>>> like). > >>>>>>>> > >>>>>>>> I think it's easiest to do both metadata and/or content in one > >>>> processor > >>>>>>>> since it can tell Tika whether to extract metadata and/or content, > >> in > >>>>>> one > >>>>>>>> pass over the file bytes (as you pointed out). > >>>>>>>> > >>>>>>>> Agreed on the exclusions trumping inclusions; I think that makes > >>>> sense. > >>>>>>>> > >>>>>>>>>> We will only have a mimetype for the original flow file itself > so > >>>> I'm > >>>>>>>> not sure about the metadata mimetype filter. > >>>>>>>> > >>>>>>>> I'm not sure where there might be an issue here. The metadata MIME > >>>> type > >>>>>>>> filter tells the processor for which MIME types to perform the > >>>> metadata > >>>>>>>> extraction. For instance, extract metadata for images and videos, > >>>> only. > >>>>>>>> This could possibly be coupled with an exclusion filter for > content > >>>> that > >>>>>>>> says, don't try to extract content from images and videos. > >>>>>>>> > >>>>>>>> I think with the six filters we get all the bases covered: > >>>>>>>> > >>>>>>>> 1. include metadata? -- > >>>>>>>> 1. yes -- > >>>>>>>> 1. determine the inclusion of metadata by filename pattern > >>>>>>>> 2. determine the inclusion of metadata by MIME type pattern > >>>>>>>> 2. no -- > >>>>>>>> 1. determine the exclusion of metadata by filename pattern > >>>>>>>> 2. determine the exclusion of metadata by MIME type pattern > >>>>>>>> 2. include content? -- > >>>>>>>> 1. yes -- > >>>>>>>> 1. determine the inclusion of content by filename pattern > >>>>>>>> 2. determine the inclusion of content by MIME type pattern > >>>>>>>> 2. no -- > >>>>>>>> 1. determine the exclusion of content by filename pattern > >>>>>>>> 2. determine the exclusion of content by MIME type pattern > >>>>>>>> > >>>>>>>> Does this work? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> - Dmitry > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <[email protected]> > >> wrote: > >>>>>>>> > >>>>>>>>> Dmitry, > >>>>>>>>> > >>>>>>>>> Looking at this and your prior email. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 1. I can see "extract metadata only" being as popular as "extract > >>>>>>>>> metadata and content". It will all depend on the type of media, > >> for > >>>>>>>>> audio/video files adding the metadata to the flow file is enough > >> but > >>>>>>>> for > >>>>>>>>> Word, PDF, etc. files the content may be wanted as well. > >>>>>>>>> 2. After thinking about it, I agree on an enum for mode. > >>>>>>>>> 3. I think any handling of zips or archive files should be > handled > >>>> by > >>>>>>>>> another processor, that keeps this processor cleaner and improves > >>>> its > >>>>>>>>> ability for re-use. > >>>>>>>>> 4. I like the addition of exclude filters but I'm not sure about > >>>>>>>> adding > >>>>>>>>> content filters. We will only have a mimetype for the original > >> flow > >>>>>>>>> file > >>>>>>>>> itself so I'm not sure about the metadata mimetype filter. I > think > >>>>>>>>> content > >>>>>>>>> filtering may be best left for another downstream processor, but > it > >>>>>>>>> might > >>>>>>>>> be run faster if included here since the entire content will be > >>>>>>>> handled > >>>>>>>>> during extraction. If the content filters are implemented, for > >>>>>>>>> performance > >>>>>>>>> they need to short circuit so that if the property is not set or > is > >>>>>>>> set > >>>>>>>>> to > >>>>>>>>> ".*" they don't evaluate the regex. > >>>>>>>>> 1. FILENAME_FILTER - selects flow files to process based on > >> filename > >>>>>>>>> matching regex. (exists) > >>>>>>>>> 2. MIMETYPE_FILTER - selects flow files to process based on > >>>>>>>> mimetype > >>>>>>>>> matching regex. (exists) > >>>>>>>>> 3. FILENAME_EXCLUDE - excludes already selected flow files from > >>>>>>>>> processing based on filename matching regex. (new) > >>>>>>>>> 4. MIMETYPE_EXCLUDE - excludes already selected flow files > from > >>>>>>>>> processing based on mimetype matching regex. (new) > >>>>>>>>> 5. CONTENT_FILTER (optional) - selects flow files for output > >>>> based > >>>>>>>> on > >>>>>>>>> extracted content matching regex. (new) > >>>>>>>>> 6. CONTENT_EXCLUDE (optional) - excludes flow files from output > >>>>>>>> based > >>>>>>>>> on extracted content matching regex. (new) > >>>>>>>>> 5. As indicated in the descriptions in #4, I don't think > >> overlapping > >>>>>>>>> filters are an error, instead excludes should take precedence > over > >>>>>>>>> includes. Then I can include a domain (like A*) but exclude > >>>> sub-sets > >>>>>>>>> (like > >>>>>>>>> AXYZ*). > >>>>>>>>> > >>>>>>>>> I'm sure there's something we missed, but I think that covers > most > >> of > >>>>>> it. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Joe > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg < > >>>>>>>>> [email protected] > >>>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Joe, > >>>>>>>>>> > >>>>>>>>>> Upon some thinking, I've started wondering whether all the cases > >> can > >>>>>> be > >>>>>>>>>> covered by the following filters: > >>>>>>>>>> > >>>>>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which > >>>> input > >>>>>>>>>> files get their content extracted, by file name > >>>>>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for > which > >>>>>> input > >>>>>>>>>> files get their metadata extracted, by file name > >>>>>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which > >>>> input > >>>>>>>>>> files get their content extracted, by MIME type > >>>>>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for > which > >>>>>> input > >>>>>>>>>> files get their metadata extracted, by MIME type > >>>>>>>>>> > >>>>>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which > >>>> input > >>>>>>>>>> files do NOT get their content extracted, by file name > >>>>>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for > which > >>>>>> input > >>>>>>>>>> files do NOT get their metadata extracted, by file name > >>>>>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which > >>>> input > >>>>>>>>>> files do NOT get their content extracted, by MIME type > >>>>>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for > which > >>>>>> input > >>>>>>>>>> files do NOT get their metadata extracted, by MIME type > >>>>>>>>>> > >>>>>>>>>> I believe this gets all the bases covered. At processor init > time, > >>>> we > >>>>>>>> can > >>>>>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a > >>>>>>>>>> configuration error. > >>>>>>>>>> > >>>>>>>>>> Let me know what you think, thanks. > >>>>>>>>>> - Dmitry > >>>>>>>>>> > >>>>>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg < > >>>>>>>>>> [email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi Joe, > >>>>>>>>>>> > >>>>>>>>>>> I follow your reasoning on the semantics of "media". One might > >>>> argue > >>>>>>>>>> that > >>>>>>>>>>> media files are a case of "document" or that a document is a > case > >>>> of > >>>>>>>>>>> "media". > >>>>>>>>>>> > >>>>>>>>>>> I'm not proposing filters for the mode of processing, I'm > >>>> proposing a > >>>>>>>>>>> flag/enum with 3 values: > >>>>>>>>>>> > >>>>>>>>>>> A) extract metadata only; > >>>>>>>>>>> B) extract content only and place it into the flowfile content; > >>>>>>>>>>> C) extract both metadata and content. > >>>>>>>>>>> > >>>>>>>>>>> I think the default should be C, to extract both. At least in > my > >>>>>>>>>>> experience most flows I've dealt with were interested in > >> extracting > >>>>>>>>> both. > >>>>>>>>>>> > >>>>>>>>>>> I don't see how this mode would benefit from being expression > >>>> driven > >>>>>>>> - > >>>>>>>>> ? > >>>>>>>>>>> > >>>>>>>>>>> I think we can add this enum mode and have the basic use case > >>>>>>>> covered. > >>>>>>>>>>> > >>>>>>>>>>> Additionally, further down the line, I was thinking we could > >> ponder > >>>>>>>> the > >>>>>>>>>>> following (these have been essential in search engine > ingestion): > >>>>>>>>>>> > >>>>>>>>>>> 1. Extraction from compressed files/archives. How would > >>>>>>>>> UnpackContent > >>>>>>>>>>> work with ExtractMediaAttributes? Use-case being, we've got a > zip > >>>>>>>>>> file as > >>>>>>>>>>> input and want to crack it open and unravel it recursively; it > >> may > >>>>>>>>>> have > >>>>>>>>>>> other, nested zips inside, along with other documents. One way > to > >>>>>>>>>> handle > >>>>>>>>>>> this is to treat the whole archive as one document and merge > all > >>>>>>>>>> attributes > >>>>>>>>>>> into one FlowFile. The other way would be to treat each > archive > >>>>>>>>>> entry as > >>>>>>>>>>> its own flow file and keep a pointer back at the parent > archive. > >>>>>>>>> Yet > >>>>>>>>>>> another case is when the user might want to only extract the > >>>>>>>> 'leaf' > >>>>>>>>>> entries > >>>>>>>>>>> and discard any parent container archives. > >>>>>>>>>>> > >>>>>>>>>>> 2. Attachments and embeddings. Users may want to treat any > >>>>>>>> attached > >>>>>>>>> or > >>>>>>>>>>> embedded files as separate flowfiles with perhaps pointers back > >> to > >>>>>>>>> the > >>>>>>>>>>> parent files. This definitely warrants a filter. Oftentimes > >> Office > >>>>>>>>>>> documents have 'media' embeddings which are often not of > >> interest, > >>>>>>>>>>> especially for the case of ingesting into a search engine. > >>>>>>>>>>> > >>>>>>>>>>> 3. PDF. For PDF's, we can do OCR. This is important for the > >>>>>>>>>>> 'image'/scanned PDF's for which Tika won't extract text. > >>>>>>>>>>> > >>>>>>>>>>> I'd like to understand how much of this is already supported in > >>>> NiFi > >>>>>>>>> and > >>>>>>>>>>> if not I'd volunteer/collaborate to implement some of this. > >>>>>>>>>>> > >>>>>>>>>>> - Dmitry > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Dmitry, > >>>>>>>>>>>> > >>>>>>>>>>>> Are you proposing separate filters that determine the mode of > >>>>>>>>>> processing, > >>>>>>>>>>>> metadata/content/metadataAndContent? I was thinking of one > >>>>>>>> selection > >>>>>>>>>>>> filters and a static mode switch at the processor instance > >> level, > >>>> to > >>>>>>>>>> make > >>>>>>>>>>>> configuration more obvious such that one instance of the > >> processor > >>>>>>>>> will > >>>>>>>>>>>> handle a known set of files regardless of the processing mode. > >>>>>>>>>>>> > >>>>>>>>>>>> I was thinking it would be useful for the mode switch to > support > >>>>>>>>>>>> expression > >>>>>>>>>>>> language, but I'm not sure about that since the selection > >> filters > >>>>>>>> will > >>>>>>>>>>>> control what files get processed and it would be harder to > >>>> configure > >>>>>>>>> if > >>>>>>>>>>>> the > >>>>>>>>>>>> output flow file could vary between source format and > extracted > >>>>>>>> text. > >>>>>>>>>> So, > >>>>>>>>>>>> while it might be easy to do, and occasionally useful, I think > >> in > >>>>>>>>> normal > >>>>>>>>>>>> use I'd never have a varying mode but would more likely have > >>>>>>>> multiple > >>>>>>>>>>>> processor instances with some routing or selection going on > >>>> further > >>>>>>>>>>>> upstream. > >>>>>>>>>>>> > >>>>>>>>>>>> I wrestled with the naming issue too. I went with > >>>>>>>>>>>> "ExtractMediaAttributes" > >>>>>>>>>>>> over "ExtractDocumentAttributes" because it seemed to > represent > >>>> the > >>>>>>>>>>>> broader > >>>>>>>>>>>> context better. In reality, media files and documents and > >>>> documents > >>>>>>>>> are > >>>>>>>>>>>> media files, but in the end it's all just semantics. > >>>>>>>>>>>> > >>>>>>>>>>>> I don't think I would change the NAR bundle name, because I > >> think > >>>>>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and > >>>> other > >>>>>>>>>> media > >>>>>>>>>>>> related processors in the future. > >>>>>>>>>>>> > >>>>>>>>>>>> Regards, > >>>>>>>>>>>> Joe > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg < > >>>>>>>>>>>> [email protected] > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Joe, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for all the details. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I wanted to propose that I do some of this work so as to go > >>>>>>>> through > >>>>>>>>>> the > >>>>>>>>>>>>> full cycle of developing a processor and committing it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Once your changes are merged, I could extend your > >>>>>>>>>> 'ExtractMediaMetadata' > >>>>>>>>>>>>> processor to handle the content, in addition to the metadata. > >>>>>>>>>>>>> > >>>>>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but > add a > >>>>>>>> mode > >>>>>>>>>>>> with 3 > >>>>>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent. > >>>>>>>>>>>>> > >>>>>>>>>>>>> One thing that looks to be a design issue right now is, your > >>>>>>>> changes > >>>>>>>>>> and > >>>>>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" > etc.) > >>>>>>>>>>>>> > >>>>>>>>>>>>> Would it make sense to have a generic processor > >>>>>>>>>>>>> ExtractDocumentMetadataAndContent? Are there enough > specifics > >> in > >>>>>>>>> the > >>>>>>>>>>>>> image/video processing stuff to warrant that to be a separate > >>>>>>>> layer; > >>>>>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ? > >> Might > >>>>>>>> it > >>>>>>>>>> make > >>>>>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> - Dmitry > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected] > > > >>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Dmitry, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Yeah, I agree, Tika is pretty impressive. The original > >> ticket, > >>>>>>>>>>>> NIFI-615 > >>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted > >>>>>>>>> extraction > >>>>>>>>>>>> of > >>>>>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika > so > >>>>>>>> for > >>>>>>>>>> the > >>>>>>>>>>>>> same > >>>>>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands. > >>>>>>>> That > >>>>>>>>>> new > >>>>>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that > >> pull > >>>>>>>>>> PR-252 > >>>>>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if > you > >>>>>>>> want > >>>>>>>>>> to > >>>>>>>>>>>>> give > >>>>>>>>>>>>>> it a try before it's merged. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Extraction content for those 1,000+ formats would be a > >> valuable > >>>>>>>>>>>> addition. > >>>>>>>>>>>>>> I see two possible approaches, 1) create a new > >>>>>>>>> "ExtractMediaContent" > >>>>>>>>>>>>>> processor that would put the document content in a new flow > >>>>>>>> file, > >>>>>>>>>> and > >>>>>>>>>>>> 2) > >>>>>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can > >>>>>>>> extract > >>>>>>>>>>>>> metadata, > >>>>>>>>>>>>>> content, or both. One combined processor makes sense if it > >> can > >>>>>>>>>>>> provide a > >>>>>>>>>>>>>> performance gain, otherwise two complementary processors may > >>>>>>>> make > >>>>>>>>>>>> usage > >>>>>>>>>>>>>> easier. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I'm glad to help if you want to take a cut at the processor > >>>>>>>>>> yourself, > >>>>>>>>>>>> or > >>>>>>>>>>>>> I > >>>>>>>>>>>>>> can take a crack at it myself if you'd prefer. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Don't hesitate to ask questions or share comments and > feedback > >>>>>>>>>>>> regarding > >>>>>>>>>>>>>> the ExtractMediaMetadata processor or the addition of > content > >>>>>>>>>>>> handling. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>> Joe Skora > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg < > >>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, Joe! > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and > >> contributing. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> While building search-related ingestion systems, I've seen > >>>>>>>>>> metadata > >>>>>>>>>>>> and > >>>>>>>>>>>>>>> text extraction being done all the time; it's always there > >> and > >>>>>>>>>>>> always > >>>>>>>>>>>>> has > >>>>>>>>>>>>>>> to be done for building search indexes. Beyond that, > >>>>>>>>> OCR-related > >>>>>>>>>>>>>>> capabilities are often requested, and the advantage of Tika > >> is > >>>>>>>>>> that > >>>>>>>>>>>> it > >>>>>>>>>>>>>>> supports OCR out of the box. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Dmitry > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt < > >>>>>>>> [email protected]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Dmitry, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding > >>>>>>>> for > >>>>>>>>>>>>>>>> extracting metadata from media files using Tika. Perhaps > it > >>>>>>>>>> makes > >>>>>>>>>>>>>>>> sense to broaden that to in general extract what Tika can > >>>>>>>>> find. > >>>>>>>>>>>> Joe > >>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if > >>>>>>>>>>>> broadening > >>>>>>>>>>>>>>>> is a good idea or if rather domain specific ones make more > >>>>>>>>>> sense. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This concept of extracting metadata from documents/text > >>>>>>>> files, > >>>>>>>>>>>> etc.. > >>>>>>>>>>>>>>>> using something like Tika is certainly useful as that then > >>>>>>>> can > >>>>>>>>>>>> drive > >>>>>>>>>>>>>>>> nice automated routing decisions. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>>>> Joe > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg > >>>>>>>>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I see that the ExtractText processor extracts text using > >>>>>>>>>> regex. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> What about a processor that extracts text and metadata > >>>>>>>> from > >>>>>>>>>>>>> incoming > >>>>>>>>>>>>>>>>> files? That doesn't seem to exist - but perhaps I didn't > >>>>>>>>>> quite > >>>>>>>>>>>>> look > >>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> right spots. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit > it, > >>>>>>>>>> using > >>>>>>>>>>>>>> Apache > >>>>>>>>>>>>>>>>> Tika. There may also be a couple of related processors > to > >>>>>>>>>> that. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thoughts? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>> - Dmitry > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >
