[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

Tim Allison (Jira) Mon, 23 Mar 2020 10:20:48 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064963#comment-17064963
 ]


Tim Allison commented on TIKA-694:
----------------------------------

One of the challenges is that different parsers may need to parse the whole 
file before having all the metadata.  In general, we try to parse the metadata 
or at least add the metadata as early as possible because as soon as we hit a 
body element, no more metadata can be written to the xhtml...although the data 
will be added to the metadata object.

In short, it is hard.

> On extraction, get properties AND / OR content extraction
> ---------------------------------------------------------
>
>                 Key: TIKA-694
>                 URL: https://issues.apache.org/jira/browse/TIKA-694
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.0
>         Environment: All OS
>            Reporter: Etienne Jouvin
>            Priority: Minor
>         Attachments: Tika-1.0.zip
>
>
> I use TIKA to extract properties, and only, on Office files.
> The parser goes throw the document content and this is not necessary and slow 
> down the process.
> It would be nice to have choice to extract only properties or not.
> What I did was the following:
> Extension of AutoDetectParser to override the parse method.
> Then in the ParseContext instance, I put a flag with boolean true to say only 
> extract the properties.
> And for example, on office file, I extended OfficeParser class. During parse 
> method, I check the flag, and if equals to true, I removed all the extraction 
> from the content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-694) On extraction, get properties AND / OR content extraction

Reply via email to