[
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
]
Tim Allison commented on TIKA-4243:
-----------------------------------
I spent a bit of time trying to serialize ParseContext, and I now
remember/newly appreciate what a challenge that is.
For one, everything has to be serializable, as we knew, with Jackson
annotations or other Jackson based methods.
I suspect that someone who really understands Jackson could do a good job of
this. I know the basics, and I am not a Jackson expert.
There are two main challenges: inheritance and embedded objects (as opposed to
parameterizable primitives).
Inheritance is complicated with Jackson. If we want to support, for example,
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the
base class name as the key and the instantiated class. I think I found out how
to do this with Jackson, but it is _messy_ (reference:
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).
We'd want to deal with embedded objects for the obvious use cases of the
CompoundDetectors, etc. where we want to specify a list of detectors. And, we
want to be able to cover the cases of setting an object as a parameter -- for
example, setting some of the slightly more complex classes in the
PDFParserConfig.
I'm wondering if it would be simpler to backoff to a Map<String, String[]>
properties kind of thing where we identify the config class and then
instantiate it for the ParseContext with the "properties". We're currently
doing something like this in tika-server where we have custom serialization
classes for each config we support (PDFParserConfig and
TesseractOCRParserConfig based on the http-headers). We'd want to extend this
to handle inheritance.
Something along these lines in json:
{code:json}
{
"settings" : {
"PDFParserConfig.class": {
"ocrDPI":300,
"sortByPosition": true,
}
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and
create the config class with the map of its values:
PDFParserConfig pdfParserConfig = new PDFParserConfig(Map<String, String[]>)
*What I don't like about this is that we're back in the game of creating our
own serialization framework. :(
*
> tika configuration overhaul
> ---------------------------
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
> Issue Type: New Feature
> Components: config
> Affects Versions: 3.0.0
> Reporter: Nicholas DiPiazza
> Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed
> Configuration schema.
> In 3.x can we remove the old way of doing configs and replace with Json
> Schema?
> Json Schema can be converted to Pojos using a maven plugin
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs.
> This can allow for the legacy tika-config XML to be read and converted to the
> new pojos easily using an XML mapper so that users don't have to use JSON
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax,
> and replace with the Pojo model serialized from the xml/json/yaml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)