[
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853241#comment-17853241
]
Tim Allison commented on TIKA-4243:
-----------------------------------
This is what the json currently looks like.
{code:json}
{
"emitter": "fse",
"fetchKey": "testPDFTwoTextBoxes.pdf",
"fetcher": "fsf",
"id": "myId",
"onParseException": "emit",
"parseContext": {
"org.apache.tika.parser.pdf.PDFParserConfig": {
"_class": "org.apache.tika.parser.pdf.PDFParserConfig",
"accessChecker": {
"_class": "org.apache.tika.parser.pdf.AccessChecker"
},
"averageCharTolerance": 0.3,
"catchIntermediateIOExceptions": true,
"detectAngles": false,
"dropThreshold": 2.5,
"enableAutoSpace": true,
"extractAcroFormContent": true,
"extractActions": false,
"extractAnnotationText": true,
"extractBookmarksText": true,
"extractFontNames": false,
"extractIncrementalUpdateInfo": false,
"extractInlineImages": false,
"extractMarkedContent": false,
"extractUniqueInlineImagesOnly": true,
"ifXFAExtractOnlyXFA": false,
"imageGraphicsEngineFactory": {
"_class":
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory"
},
"imageStrategy": "NONE",
"maxIncrementalUpdates": 10,
"maxMainMemoryBytes": 536870912,
"ocrDPI": 300,
"ocrImageFormatName": "png",
"ocrImageQuality": 1.0,
"ocrImageType": "GRAY",
"ocrRenderingStrategy": "ALL",
"ocrStrategy": "AUTO",
"parseIncrementalUpdates": false,
"renderer": null,
"setKCMS": false,
"sortByPosition": true,
"spacingTolerance": 0.5,
"suppressDuplicateOverlappingText": false,
"throwOnEncryptedPayload": false
}
}
}{code}
> tika configuration overhaul
> ---------------------------
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
> Issue Type: New Feature
> Components: config
> Affects Versions: 3.0.0
> Reporter: Nicholas DiPiazza
> Priority: Major
> Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed
> Configuration schema.
> In 3.x can we remove the old way of doing configs and replace with Json
> Schema?
> Json Schema can be converted to Pojos using a maven plugin
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs.
> This can allow for the legacy tika-config XML to be read and converted to the
> new pojos easily using an XML mapper so that users don't have to use JSON
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax,
> and replace with the Pojo model serialized from the xml/json/yaml.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)