[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852876#comment-17852876
 ] 

Tim Allison edited comment on TIKA-4243 at 6/6/24 5:39 PM:
---

I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. 
There's more work, but I think we can close this out.

If we do want to head down the jsonschema route later, let's open a new ticket?


was (Author: talli...@mitre.org):
I think our joint recent PR on TIKA-4252 accomplishes the goals of this ticket. 
There's more work, but I think we can close this out.

If we do want to head down the jsonschema root later, let's open a new ticket?

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852804#comment-17852804
 ] 

Tim Allison edited comment on TIKA-4243 at 6/6/24 2:11 PM:
---

Current status on TIKA-4243 branch -- works up through and including tika-app

Still need:
* better job of handling lists and maps as parameters and types.
* test tika-server pipes/ and async/ endpoints
* more unit tests in new serialization stuff

Ongoing needs:
* modify config objects so that they work with the serialization methods



was (Author: talli...@mitre.org):
Current status on TIKA-4243 -- works up through and including tika-app

Still need:
* better job of handling lists and maps as parameters and types.
* test tika-server pipes/ and async/ endpoints
* more unit tests in new serialization stuff

Ongoing needs:
* modify config objects so that they work with the serialization methods


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:10 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory.class: {
"_class":"com.tika.custom.OurCompanysFactory",
   "speed":"blazing",
   "dpi":1000
}
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Is

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority:

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of d

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter -- for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig. 

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json: 

{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}

Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(
*


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doi

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849103#comment-17849103
 ] 

Tim Allison edited comment on TIKA-4243 at 5/24/24 1:00 PM:


Proposed basic roadmap:

Add parseContext to fetchers and emitters (and pipesReporter?)
Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.



was (Author: talli...@mitre.org):
Proposed basic roadmap:

Serialize ParseContext as is...
Allow for serialization of current XConfigs, eg. PDFParserConfig, etc.
Add creation of parsers with e.g. new PDFParser(ParseContext context).
Wire config stuff into tika-server, tika-pipes, tika-app
Merge tika-grpc-server with new config options

This would require serialization of classes that users want to be able to 
configure + serialization.

This would allow us to get rid of all of our custom serialization stuff for 
Tika 4.x.


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-05-01 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842622#comment-17842622
 ] 

Nicholas DiPiazza edited comment on TIKA-4243 at 5/1/24 12:34 PM:
--

Kinda seems like it might belong in a new  tika-config module 


was (Author: ndipiazza):
Kinda seems like it might belong in tika-config module 

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-29 Thread Nicholas DiPiazza (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842158#comment-17842158
 ] 

Nicholas DiPiazza edited comment on TIKA-4243 at 4/29/24 8:56 PM:
--

this seems like a major feature thing so i would recommend having it go with 
the tika 3.0.0 release 

makes sense if the tika 2.0.0 stays compatible


was (Author: ndipiazza):
this seems like a major feature thing so i would recommend with tika 3.x

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-04-26 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841242#comment-17841242
 ] 

Tim Allison edited comment on TIKA-4243 at 4/26/24 1:32 PM:


I really, really want to clean up our configuration, and moving to JSON makes 
sense. 

I agree we need to support the legacy config of 2.x in 3.x.

Is there a reason not to use plain old Jackson databind? What does 
jsonschema2pojo buy us?

Will this new capability live in tika-serialization?

It will be great to convert these config objects to Records in Java 17, er Tika 
4.x?

Would this allow us to get rid of our, ahem, baroque config processing code and 
still read 2.x configs?  I admit responsibility for the baroque config stuff, 
and I would really appreciate the opportunity to get rid of it asap... as long 
as we have backwards compatibility.

Thank you [~ndipiazza]!


was (Author: talli...@mitre.org):
I really, really want to clean up our configuration, and moving to JSON makes 
sense. 

I agree we need to support the legacy config of 2.x in 3.x.

Is there a reason not to use plain old Jackson databind? What does 
jsonschema2pojo buy us?

Will this new capability live in tika-serialization?

It will be great to convert these config objects to Records in Java 17, er Tika 
4.x?

Thank you [~ndipiazza]!

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)