Peter, I did not intend to cause pain. It felt like I spent numerous hours trying to help you debug what you were seeing and explaining the current configuration methods. I was unsuccessful in communicating to you that what you were seeing was "expected."
Rather than spend more time trying to explain unsuccessfully how configuration worked, I thought it better to simplify configuration and make "updating" via the configs in the parsecontext possible (rather than overwriting). If you remember, that was the part that you asked for numerous times and/or expressed surprise around. In short, the PDFParserConfig and the TesseractOCRConfig, when sent in via the ParseContext, will update the settings from the baseline as set in the initial tika-config. Unit tests that demonstrate this new behavior are here: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java#L1072 https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java#L111 Cheers, Tim On Tue, Feb 9, 2021 at 4:21 PM Peter Kronenberg <[email protected]> wrote: > > You're killing me here! I just finished an implementation that relies on > this. > I never figured out how to set properties at runtime if I use tika-config. > > Can you please provide an example of setting properties with tika-config and > then optionally changing them at runtime? How does the TesseractOCRConfig > and PDFParser objects get initialized if not from the corresponding > .properties file? > > -----Original Message----- > From: Tim Allison (Jira) <[email protected]> > Sent: Tuesday, February 9, 2021 4:10 PM > To: [email protected] > Subject: {EXTERNAL}[jira] [Commented] (TIKA-3297) Simplify parser > configuration in 2.x > > CAUTION: This email originated from outside of the organization. DO NOT click > links or open attachments unless you recognize the sender and know the > content is safe. > > [ > https://issues.apache.org/jira/browse/TIKA-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282043#comment-17282043 > ] > > Tim Allison commented on TIKA-3297: > ----------------------------------- > > I got rid of the .properties for tesseract. Users can no longer set the > tesseract path, tess data or imagemagick via the TesseractOCRConfig. These > _must_ be set via a tika-config.xml. If there is a use case for setting > these at parse time, let me know. > > > > Now, when a user sends in a TesseractOCRConfig at parse time, that config > remembers what fields the user set. The TesseractOCRParser will now clone > the default internal config and update only those fields that the user has > manipulated and sent in via the ParseContext. In short, this will now > "update" the baseline set via the tika-config.xml. It will not overwrite > what was set in the tika-config.xml file. > > > > If this looks good, I'll do the same to the PDFParser. > > > Simplify parser configuration in 2.x > > ------------------------------------ > > > > Key: TIKA-3297 > > URL: https://issues.apache.org/jira/browse/TIKA-3297 > > Project: Tika > > Issue Type: Task > > Reporter: Tim Allison > > Priority: Major > > > > We currently have .properties files and tika-config.xml and runtime > > configuration. We should simplify to tika-config.xml. > > From a security perspective, I'm thinking we should also allow executable > > paths to be set only via tika-config.xml...not programmatically via a > > TesseractConfig. > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005)
