Re: Text and metadata extraction processor

Simon Ball Thu, 31 Mar 2016 17:44:58 -0700

What I’m suggesting is a single processor for both, but instead of using a mode 
property to determine which bits get extracted, you use the state of the 
relations on the processor to configure which options tika uses and using a 
single pass to actually parse metadata into attributes, and content into a new 
flow file transfer into the parsed relation.


On the tesseract front, it may make sense to do this through a controller 
service. 

A PDF processor might be interesting. Are you thinking of something like 
PDFBox, or tika again?

Simon


> On 1 Apr 2016, at 01:30, Dmitry Goldenberg <[email protected]> wrote:
> 
> Simon,
> 
> Interesting commentary.  The issue that Joe and I have both looked at, with
> the splitting of metadata and content extraction, is that if they're split
> then the underlying Tika extraction has to process the file twice: once to
> pull out the attributes and once to pull out the content.  Perhaps it may
> be good to add ExtractMetadata and ExtractTextContent in addition to
> ExtractMediaAttributes - ? Seems kind of an overkill but I may be wrong.
> 
> It seems prudent to provide one wholesome, out-of-the-box extractor
> processor with options to extract just metadata, just content, or both
> metadata and content.
> 
> I think what I'm hearing is that we need to allow for checking somewhere
> for whether text/content has already been extracted by the time we get to
> the ExtractMediaAttributes processor - ?  If that is the issue then I
> believe the user would use RouteOnAttribute and if the content is already
> filled in then they'd not route to ExtractMediaAttributes.
> 
> As far as the OCR.  Tika internally supports OCR by directing image files
> to Tesseract (if Tesseract is installed and configured properly).  We've
> started talking about how this could be reconciled in the
> ExtractMediaAttributes.
> 
> I think that once we have the basic ExtractMediaAttributes, we could add
> filters for what files to enable the OCR on, and we'd need to expose a few
> config parameters specific to OCR, such as e.g. the location of the
> Tesseract installation and the maximum file size on which to attempt the
> OCR.  Perhaps there can also be a RunOCR processor which would be dedicated
> to running OCR.  But since Tika already has OCR integrated we'd probably
> want to take care of that in the ExtractMediaAttributes configuration.
> 
> Additionally, I've proposed the idea of a ProcessPDF processor which would
> ascertain whether a PDF is 'text' or 'scanned'. If scanned, we would break
> it up into pages and run OCR on each page, then aggregate the extracted
> text.
> 
> - Dmitry
> 
> 
> 
> On Thu, Mar 31, 2016 at 3:19 PM, Simon Ball <[email protected]> wrote:
> 
>> Just a thought…
>> 
>> To keep consistent with other Nifi Parse patterns, would it make sense to
>> based the extraction of content on the presence of a relation. So your tika
>> processor would have an original relation which would have meta data
>> attached as attributed, and an extracted relation which would have the
>> metadata and the processed content (text from OCRed image for example).
>> That way you can just use context.hasConnection(relationship) to determine
>> whether to enable the tika content processing.
>> 
>> This seems more idiomatic than a mode flag.
>> 
>> Simon
>> 
>>> On 31 Mar 2016, at 19:48, Joe Skora <[email protected]> wrote:
>>> 
>>> Dmitry,
>>> 
>>> I think we're good.  I was confused because "XXX_METADATA MIMETYPE
>> FILTER"
>>> entries referred to some MIME type of the metadata, but you meant to use
>>> the file's MIME type to select what files have metadata extracted.
>>> 
>>> Sorry, about that, I think we are on the same page.
>>> 
>>> Joe
>>> 
>>> On Thu, Mar 31, 2016 at 11:40 AM, Dmitry Goldenberg <
>>> [email protected]> wrote:
>>> 
>>>> Hi Joe,
>>>> 
>>>> I think if we have the filters in place then there's no need for the
>> 'mode'
>>>> enum, as the filters themselves guide the processor in deciding whether
>>>> metadata and/or content is extracted for a given input file.
>>>> 
>>>> Agreed on the handling of archives as a separate processor (template,
>> seems
>>>> like).
>>>> 
>>>> I think it's easiest to do both metadata and/or content in one processor
>>>> since it can tell Tika whether to extract metadata and/or content, in
>> one
>>>> pass over the file bytes (as you pointed out).
>>>> 
>>>> Agreed on the exclusions trumping inclusions; I think that makes sense.
>>>> 
>>>>>> We will only have a mimetype for the original flow file itself so I'm
>>>> not sure about the metadata mimetype filter.
>>>> 
>>>> I'm not sure where there might be an issue here. The metadata MIME type
>>>> filter tells the processor for which MIME types to perform the metadata
>>>> extraction.  For instance, extract metadata for images and videos, only.
>>>> This could possibly be coupled with an exclusion filter for content that
>>>> says, don't try to extract content from images and videos.
>>>> 
>>>> I think with the six filters we get all the bases covered:
>>>> 
>>>>  1. include metadata? --
>>>>     1. yes --
>>>>        1. determine the inclusion of metadata by filename pattern
>>>>        2. determine the inclusion of metadata by MIME type pattern
>>>>     2. no --
>>>>        1. determine the exclusion of metadata by filename pattern
>>>>        2. determine the exclusion of metadata by MIME type pattern
>>>>     2. include content? --
>>>>     1. yes --
>>>>        1. determine the inclusion of content by filename pattern
>>>>        2. determine the inclusion of content by MIME type pattern
>>>>     2. no --
>>>>        1. determine the exclusion of content by filename pattern
>>>>        2. determine the exclusion of content by MIME type pattern
>>>> 
>>>> Does this work?
>>>> 
>>>> Thanks,
>>>> - Dmitry
>>>> 
>>>> 
>>>> On Thu, Mar 31, 2016 at 9:27 AM, Joe Skora <[email protected]> wrote:
>>>> 
>>>>> Dmitry,
>>>>> 
>>>>> Looking at this and your prior email.
>>>>> 
>>>>> 
>>>>>  1. I can see "extract metadata only" being as popular as "extract
>>>>>  metadata and content".  It will all depend on the type of media, for
>>>>>  audio/video files adding the metadata to the flow file is enough but
>>>> for
>>>>>  Word, PDF, etc. files the content may be wanted as well.
>>>>>  2. After thinking about it, I agree on an enum for mode.
>>>>>  3. I think any handling of zips or archive files should be handled by
>>>>>  another processor, that keeps this processor cleaner and improves its
>>>>>  ability for re-use.
>>>>>  4. I like the addition of exclude filters but I'm not sure about
>>>> adding
>>>>>  content filters.  We will only have a mimetype for the original flow
>>>>> file
>>>>>  itself so I'm not sure about the metadata mimetype filter.  I think
>>>>> content
>>>>>  filtering may be best left for another downstream processor, but it
>>>>> might
>>>>>  be run faster if included here since the entire content will be
>>>> handled
>>>>>  during extraction.  If the content filters are implemented, for
>>>>> performance
>>>>>  they need to short circuit so that if the property is not set or is
>>>> set
>>>>> to
>>>>>  ".*" they don't evaluate the regex.
>>>>>  1. FILENAME_FILTER - selects flow files to process based on filename
>>>>>     matching regex. (exists)
>>>>>     2. MIMETYPE_FILTER - selects flow files to process based on
>>>> mimetype
>>>>>     matching regex. (exists)
>>>>>     3. FILENAME_EXCLUDE - excludes already selected flow files from
>>>>>     processing based on filename matching regex. (new)
>>>>>     4. MIMETYPE_EXCLUDE - excludes already selected flow  files from
>>>>>     processing based on mimetype matching regex. (new)
>>>>>     5. CONTENT_FILTER (optional) - selects flow files for output based
>>>> on
>>>>>     extracted content matching regex. (new)
>>>>>     6. CONTENT_EXCLUDE (optional) - excludes flow files from output
>>>> based
>>>>>     on extracted content matching regex. (new)
>>>>>  5. As indicated in the descriptions in #4, I don't think overlapping
>>>>>  filters are an error, instead excludes should take precedence over
>>>>>  includes.  Then I can include a domain (like A*) but exclude sub-sets
>>>>> (like
>>>>>  AXYZ*).
>>>>> 
>>>>> I'm sure there's something we missed, but I think that covers most of
>> it.
>>>>> 
>>>>> Regards,
>>>>> Joe
>>>>> 
>>>>> 
>>>>> On Wed, Mar 30, 2016 at 1:56 PM, Dmitry Goldenberg <
>>>>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> Joe,
>>>>>> 
>>>>>> Upon some thinking, I've started wondering whether all the cases can
>> be
>>>>>> covered by the following filters:
>>>>>> 
>>>>>> INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
>>>>>> files get their content extracted, by file name
>>>>>> INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>> files get their metadata extracted, by file name
>>>>>> INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
>>>>>> files get their content extracted, by MIME type
>>>>>> INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>> files get their metadata extracted, by MIME type
>>>>>> 
>>>>>> EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input
>>>>>> files do NOT get their content extracted, by file name
>>>>>> EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which
>> input
>>>>>> files do NOT get their metadata extracted, by file name
>>>>>> EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input
>>>>>> files do NOT get their content extracted, by MIME type
>>>>>> EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which
>> input
>>>>>> files do NOT get their metadata extracted, by MIME type
>>>>>> 
>>>>>> I believe this gets all the bases covered. At processor init time, we
>>>> can
>>>>>> analyze the inclusions vs. exclusions; any overlap would cause a
>>>>>> configuration error.
>>>>>> 
>>>>>> Let me know what you think, thanks.
>>>>>> - Dmitry
>>>>>> 
>>>>>> On Wed, Mar 30, 2016 at 10:41 AM, Dmitry Goldenberg <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> Hi Joe,
>>>>>>> 
>>>>>>> I follow your reasoning on the semantics of "media".  One might argue
>>>>>> that
>>>>>>> media files are a case of "document" or that a document is a case of
>>>>>>> "media".
>>>>>>> 
>>>>>>> I'm not proposing filters for the mode of processing, I'm proposing a
>>>>>>> flag/enum with 3 values:
>>>>>>> 
>>>>>>> A) extract metadata only;
>>>>>>> B) extract content only and place it into the flowfile content;
>>>>>>> C) extract both metadata and content.
>>>>>>> 
>>>>>>> I think the default should be C, to extract both.  At least in my
>>>>>>> experience most flows I've dealt with were interested in extracting
>>>>> both.
>>>>>>> 
>>>>>>> I don't see how this mode would benefit from being expression driven
>>>> -
>>>>> ?
>>>>>>> 
>>>>>>> I think we can add this enum mode and have the basic use case
>>>> covered.
>>>>>>> 
>>>>>>> Additionally, further down the line, I was thinking we could ponder
>>>> the
>>>>>>> following (these have been essential in search engine ingestion):
>>>>>>> 
>>>>>>>  1. Extraction from compressed files/archives. How would
>>>>> UnpackContent
>>>>>>>  work with ExtractMediaAttributes? Use-case being, we've got a zip
>>>>>> file as
>>>>>>>  input and want to crack it open and unravel it recursively; it may
>>>>>> have
>>>>>>>  other, nested zips inside, along with other documents. One way to
>>>>>> handle
>>>>>>>  this is to treat the whole archive as one document and merge all
>>>>>> attributes
>>>>>>>  into one FlowFile.  The other way would be to treat each archive
>>>>>> entry as
>>>>>>>  its own flow file and keep a pointer back at the parent archive.
>>>>> Yet
>>>>>>>  another case is when the user might want to only extract the
>>>> 'leaf'
>>>>>> entries
>>>>>>>  and discard any parent container archives.
>>>>>>> 
>>>>>>>  2. Attachments and embeddings. Users may want to treat any
>>>> attached
>>>>> or
>>>>>>>  embedded files as separate flowfiles with perhaps pointers back to
>>>>> the
>>>>>>>  parent files. This definitely warrants a filter. Oftentimes Office
>>>>>>>  documents have 'media' embeddings which are often not of interest,
>>>>>>>  especially for the case of ingesting into a search engine.
>>>>>>> 
>>>>>>>  3. PDF. For PDF's, we can do OCR. This is important for the
>>>>>>>  'image'/scanned PDF's for which Tika won't extract text.
>>>>>>> 
>>>>>>> I'd like to understand how much of this is already supported in NiFi
>>>>> and
>>>>>>> if not I'd volunteer/collaborate to implement some of this.
>>>>>>> 
>>>>>>> - Dmitry
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 30, 2016 at 9:03 AM, Joe Skora <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Dmitry,
>>>>>>>> 
>>>>>>>> Are you proposing separate filters that determine the mode of
>>>>>> processing,
>>>>>>>> metadata/content/metadataAndContent?  I was thinking of one
>>>> selection
>>>>>>>> filters and a static mode switch at the processor instance level, to
>>>>>> make
>>>>>>>> configuration more obvious such that one instance of the processor
>>>>> will
>>>>>>>> handle a known set of files regardless of the processing mode.
>>>>>>>> 
>>>>>>>> I was thinking it would be useful for the mode switch to support
>>>>>>>> expression
>>>>>>>> language, but I'm not sure about that since the selection filters
>>>> will
>>>>>>>> control what files get processed and it would be harder to configure
>>>>> if
>>>>>>>> the
>>>>>>>> output flow file could vary between source format and extracted
>>>> text.
>>>>>> So,
>>>>>>>> while it might be easy to do, and occasionally useful, I think in
>>>>> normal
>>>>>>>> use I'd never have a varying mode but would more likely have
>>>> multiple
>>>>>>>> processor instances with some routing or selection going on further
>>>>>>>> upstream.
>>>>>>>> 
>>>>>>>> I wrestled with the naming issue too.  I went with
>>>>>>>> "ExtractMediaAttributes"
>>>>>>>> over "ExtractDocumentAttributes" because it seemed to represent the
>>>>>>>> broader
>>>>>>>> context better.  In reality, media files and documents and documents
>>>>> are
>>>>>>>> media files, but in the end it's all just semantics.
>>>>>>>> 
>>>>>>>> I don't think I would change the NAR bundle name, because I think
>>>>>>>> "nifi-media-nar" establishes it as a place to collect this and other
>>>>>> media
>>>>>>>> related processors in the future.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Joe
>>>>>>>> 
>>>>>>>> On Tue, Mar 29, 2016 at 3:09 PM, Dmitry Goldenberg <
>>>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Joe,
>>>>>>>>> 
>>>>>>>>> Thanks for all the details.
>>>>>>>>> 
>>>>>>>>> I wanted to propose that I do some of this work so as to go
>>>> through
>>>>>> the
>>>>>>>>> full cycle of developing a processor and committing it.
>>>>>>>>> 
>>>>>>>>> Once your changes are merged, I could extend your
>>>>>> 'ExtractMediaMetadata'
>>>>>>>>> processor to handle the content, in addition to the metadata.
>>>>>>>>> 
>>>>>>>>> We could keep the FILENAME_FILTER and MIMETYPE_FILTER but add a
>>>> mode
>>>>>>>> with 3
>>>>>>>>> values: metadataOnly, contentOnly, metadataAndContent.
>>>>>>>>> 
>>>>>>>>> One thing that looks to be a design issue right now is, your
>>>> changes
>>>>>> and
>>>>>>>>> the 'nomenclature' seem media-oriented ("nifi-media-nar" etc.)
>>>>>>>>> 
>>>>>>>>> Would it make sense to have a generic processor
>>>>>>>>> ExtractDocumentMetadataAndContent?  Are there enough specifics in
>>>>> the
>>>>>>>>> image/video processing stuff to warrant that to be a separate
>>>> layer;
>>>>>>>>> perhaps a subclass of ExtractDocumentMetadataAndContent ?  Might
>>>> it
>>>>>> make
>>>>>>>>> sense to rename nifi-media-nar into nifi-text-extract-nar ?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> - Dmitry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 29, 2016 at 2:36 PM, Joe Skora <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Dmitry,
>>>>>>>>>> 
>>>>>>>>>> Yeah, I agree, Tika is pretty impressive.  The original ticket,
>>>>>>>> NIFI-615
>>>>>>>>>> <https://issues.apache.org/jira/browse/NIFI-615>, wanted
>>>>> extraction
>>>>>>>> of
>>>>>>>>>> metadata from WAV files, but as I got into it I found Tika so
>>>> for
>>>>>> the
>>>>>>>>> same
>>>>>>>>>> effort it supports the 1,000+ file formats Tika understands.
>>>> That
>>>>>> new
>>>>>>>>>> processor called "ExtractMediaMetadata", you can pull that pull
>>>>>> PR-252
>>>>>>>>>> <https://github.com/apache/nifi/pull/252> from GitHub if you
>>>> want
>>>>>> to
>>>>>>>>> give
>>>>>>>>>> it a try before it's merged.
>>>>>>>>>> 
>>>>>>>>>> Extraction content for those 1,000+ formats would be a valuable
>>>>>>>> addition.
>>>>>>>>>> I see two possible approaches, 1) create a new
>>>>> "ExtractMediaContent"
>>>>>>>>>> processor that would put the document content in a new flow
>>>> file,
>>>>>> and
>>>>>>>> 2)
>>>>>>>>>> extend the new "ExtractMediaMetadata" processor so it can
>>>> extract
>>>>>>>>> metadata,
>>>>>>>>>> content, or both.  One combined processor makes sense if it can
>>>>>>>> provide a
>>>>>>>>>> performance gain, otherwise two complementary processors may
>>>> make
>>>>>>>> usage
>>>>>>>>>> easier.
>>>>>>>>>> 
>>>>>>>>>> I'm glad to help if you want to take a cut at the processor
>>>>>> yourself,
>>>>>>>> or
>>>>>>>>> I
>>>>>>>>>> can take a crack at it myself if you'd prefer.
>>>>>>>>>> 
>>>>>>>>>> Don't hesitate to ask questions or share comments and feedback
>>>>>>>> regarding
>>>>>>>>>> the ExtractMediaMetadata processor or the addition of content
>>>>>>>> handling.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Joe Skora
>>>>>>>>>> 
>>>>>>>>>> On Thu, Mar 24, 2016 at 11:40 AM, Dmitry Goldenberg <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks, Joe!
>>>>>>>>>>> 
>>>>>>>>>>> Hi Joe S. - I'm definitely up for discussing and contributing.
>>>>>>>>>>> 
>>>>>>>>>>> While building search-related ingestion systems, I've seen
>>>>>> metadata
>>>>>>>> and
>>>>>>>>>>> text extraction being done all the time; it's always there and
>>>>>>>> always
>>>>>>>>> has
>>>>>>>>>>> to be done for building search indexes.  Beyond that,
>>>>> OCR-related
>>>>>>>>>>> capabilities are often requested, and the advantage of Tika is
>>>>>> that
>>>>>>>> it
>>>>>>>>>>> supports OCR out of the box.
>>>>>>>>>>> 
>>>>>>>>>>> - Dmitry
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Mar 24, 2016 at 11:36 AM, Joe Witt <
>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Dmitry,
>>>>>>>>>>>> 
>>>>>>>>>>>> Another community member (Joe Skora) has a PR outstanding
>>>> for
>>>>>>>>>>>> extracting metadata from media files using Tika.  Perhaps it
>>>>>> makes
>>>>>>>>>>>> sense to broaden that to in general extract what Tika can
>>>>> find.
>>>>>>>> Joe
>>>>>>>>> -
>>>>>>>>>>>> perhaps you can discuss your ideas with Dmitry and see if
>>>>>>>> broadening
>>>>>>>>>>>> is a good idea or if rather domain specific ones make more
>>>>>> sense.
>>>>>>>>>>>> 
>>>>>>>>>>>> This concept of extracting metadata from documents/text
>>>> files,
>>>>>>>> etc..
>>>>>>>>>>>> using something like Tika is certainly useful as that then
>>>> can
>>>>>>>> drive
>>>>>>>>>>>> nice automated routing decisions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Joe
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Mar 24, 2016 at 9:28 AM, Dmitry Goldenberg
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I see that the ExtractText processor extracts text using
>>>>>> regex.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What about a processor that extracts text and metadata
>>>> from
>>>>>>>>> incoming
>>>>>>>>>>>>> files?  That doesn't seem to exist - but perhaps I didn't
>>>>>> quite
>>>>>>>>> look
>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>>> right spots.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If that doesn't exist I'd like to implement and commit it,
>>>>>> using
>>>>>>>>>> Apache
>>>>>>>>>>>>> Tika.  There may also be a couple of related processors to
>>>>>> that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> - Dmitry
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Text and metadata extraction processor

Reply via email to