[
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117
]
Tim Allison edited comment on TIKA-1508 at 3/9/16 1:56 PM:
-----------------------------------------------------------
[~thammegowda], this looks really good. I merged it on a local branch and made
minimal modifications to the PDFParser to make this work...and it did...very
straightforwardly.
Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have
collisions with different parsers if anyone uses {{configure()}} outside of the
normal course of events...it is simpler to use Map<String,String>. Or, if we
do use the ParseContext, we should specify which parser the params are for,
e.g. {{context.set{{PDFParser.class, Map<String,String> params}}. I do like
the dual use of configure with ParseContext to achieve Nick's recommendation
elegantly.
2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}}
interface so that when we serialize the config to XML, we can remember what the
params were. We should also add that to the TikaConfigSerializer.
3) It would be great to add parameter checking into the {{AbstractParser}} or
somewhere else? I think a configurable (parser? or all configurables?) should
need to register valid configuration keys at initialization, and then we can
check the validity of the keys passed in during {{configure()}} once in the
base class so that each extending parser isn't required to do this on its own.
4) Let's subclass TikaException for TikaParameterConfigException? I don't feel
strongly about this one.
5) We'll need to add {{@Override configure()}} to pass on the configuration
information to the wrapped parser in parser wrappers: ParserDecorator,
DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the
parameters in the wrapped parser before wrapping?
Questions for the broader dev community:
A) Are we ok with Map<String,String> parameters? Or should we follow, say,
Solr's syntax for type checking?
{noformat}
<int name="pageWidth">10</int>
{noformat}
B) We could use reflection to get around each parser having to add its own
configuration code. We could create a static configurator that has a
{{configure(Configurable configurable, Map<String, String> params}} method.
That isn't quite right, because we'd have to know the type for each param (see
above), but something along those lines. Too complex?
was (Author: [email protected]):
[~thammegowda], this looks really good. I merged it on a local branch and made
minimal modifications to the PDFParser to make this work...and it did...very
straightforwardly.
Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have
collisions with different parsers if anyone uses {{configure()}} outside of the
normal course of events...it is simpler to use Map<String,String>. Or, if we
do use the ParseContext, we should specify which parser the params are for,
e.g. {{context.set{{PDFParser.class, Map<String,String> params}}. I do like
the dual use of configure with ParseContext to achieve Nick's recommendation
elegantly.
2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}}
interface so that when we serialize the config to XML, we can remember what the
params were. We should also add that to the TikaConfigSerializer.
3) It would be great to add parameter checking into the {{AbstractParser}} or
somewhere else? I think a configurable (parser? or all configurables?) should
need to register valid configuration keys at initialization, and then we can
check the validity of the keys passed in during {{configure()}} once in the
base class so that each extending parser isn't required to do this on its own.
4) Let's subclass TikaException for TikaParameterConfigException? I don't feel
strongly about this one.
5) We'll need to add {{@Override configure()}} to pass on the configuration
information to the wrapped parser in parser wrappers: ParserDecorator,
DelegatingParser, ParserPostProcessor...any others? Or, do we need to set the
parameters in the wrapped parser before wrapping?
Questions for the broader dev community:
A) Are we ok with Map<String,String> parameters? Or should we follow, say,
Solr's syntax for type checking?
{{noformat}}
<int name="pageWidth">10</int>
{{noformat}}
B) We could use reflection to get around each parser having to add its own
configuration code. We could create a static configurator that has a
{{configure(Configurable configurable, Map<String, String> params}} method.
That isn't quite right, because we'd have to know the type for each param (see
above), but something along those lines. Too complex?
> Add uniformity to parser parameter configuration
> ------------------------------------------------
>
> Key: TIKA-1508
> URL: https://issues.apache.org/jira/browse/TIKA-1508
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser,
> it would be great if we could specify parser parameters in the main config
> file, something along the lines of this:
> {noformat}
> <parser class="org.apache.tika.parser.audio.AudioParser">
> <params>
> <int name="someparam1">2</int>
> <str name="someOtherParam2">something or other</str>
> </params>
> <mime>audio/basic</mime>
> <mime>audio/x-aiff</mime>
> <mime>audio/x-wav</mime>
> </parser>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)