[
https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176702#comment-13176702
]
Jukka Zitting commented on TIKA-830:
------------------------------------
The problem here is the basic assumption that the Tika facade class makes about
how the configured parser will use the instance passed in the ParseContext.
By default (and before we added the constructor that allows a custom parser to
be given) the Tika facade will construct and use an AutoDetectParser based on
all the available and/or configured format-specific parsers. Format-specific
parsers that support embedded documents expect the ParseContext to contain a
parser instance that they can delegate parsing tasks to, so to support parsing
of embedded documents the Tika facade passes the configured parser instance
through the ParseContext.
The ForkParser on the other hand assumes that anything in the ParseContext is
serializable so that it can be sent to the forked JVM process for use from
there. Passing a ForkParser instance to the forked JVM like through the
ParseContext could easily trigger a recursion of new JVM forks being created,
which is why the ForkParser by design is not serializable.
I agree with Nick that the resulting error message could certainly be better,
but I don't it's a good idea to change the basic design of either ForkParser or
the Tika facade class in this respect.
If we want the Tika facade class to support forked parsing, I think it would be
better to add a separate flag for that to explicitly make the facade class
create and use a ForkParser instance based on the configured normal Parser
instance. However, the ForkParser is a pretty complex tool that practically
always needs custom configuration (java command, memory limits, class loader,
etc.), which is why I don't think we should expose it through the Tika facade
that's mostly designed for simpler use cases.
PS. Instead of the instanceof check we now have in ForkParser (thanks for that,
BTW!), it might be a better idea to check for errors from trying to serialize
the ParseContext. That'll capture a muhc wider range of cases where a
ForkParser instance or some other non-serializable resource is being passed to
a forked JVM.
> Tika.parseToString() causes ForkParser to try to serialize itself
> -----------------------------------------------------------------
>
> Key: TIKA-830
> URL: https://issues.apache.org/jira/browse/TIKA-830
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.0
> Reporter: Jerome Lacoste
> Priority: Blocker
> Attachments:
> 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch,
> 0006-TIKA-830-refactor-tests-for-clarity.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira