[ 
https://issues.apache.org/jira/browse/TIKA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187117#comment-15187117
 ] 

Tim Allison edited comment on TIKA-1508 at 3/9/16 1:56 PM:
-----------------------------------------------------------

[~thammegowda], this looks really good. I merged it on a local branch and made 
minimal modifications to the PDFParser to make this work...and it did...very 
straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have 
collisions with different parsers if anyone uses {{configure()}} outside of the 
normal course of events...it is simpler to use Map<String,String>.  Or, if we 
do use the ParseContext, we should specify which parser the params are for, 
e.g. {{context.set{{PDFParser.class, Map<String,String> params}}.  I do like 
the dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.


2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}} 
interface so that when we serialize the config to XML, we can remember what the 
params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or 
somewhere else?  I think a configurable (parser? or all configurables?) should 
need to register valid configuration keys at initialization, and then we can 
check the validity of the keys passed in during {{configure()}} once in the 
base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel 
strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration 
information to the wrapped parser in parser wrappers: ParserDecorator, 
DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the 
parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map<String,String> parameters? Or should we follow, say, 
Solr's syntax for type checking?
{noformat}
<int name="pageWidth">10</int>
{noformat}

B) We could use reflection to get around each parser having to add its own 
configuration code.  We could create a static configurator  that has a 
{{configure(Configurable configurable, Map<String, String> params}} method.  
That isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines.  Too complex?


was (Author: [email protected]):
[~thammegowda], this looks really good. I merged it on a local branch and made 
minimal modifications to the PDFParser to make this work...and it did...very 
straightforwardly.

Recommendations:
1) Let's not use ParseContext as the vehicle for param passing, we will have 
collisions with different parsers if anyone uses {{configure()}} outside of the 
normal course of events...it is simpler to use Map<String,String>.  Or, if we 
do use the ParseContext, we should specify which parser the params are for, 
e.g. {{context.set{{PDFParser.class, Map<String,String> params}}.  I do like 
the dual use of configure with ParseContext to achieve Nick's recommendation 
elegantly.


2) We need to add a {{Map<String,String> getParams()}} to the {{Configurable}} 
interface so that when we serialize the config to XML, we can remember what the 
params were.  We should also add that to the TikaConfigSerializer.

3) It would be great to add parameter checking into the {{AbstractParser}} or 
somewhere else?  I think a configurable (parser? or all configurables?) should 
need to register valid configuration keys at initialization, and then we can 
check the validity of the keys passed in during {{configure()}} once in the 
base class so that each extending parser isn't required to do this on its own.

4) Let's subclass TikaException for TikaParameterConfigException?  I don't feel 
strongly about this one.

5) We'll need to add {{@Override configure()}} to pass on the configuration 
information to the wrapped parser in parser wrappers: ParserDecorator, 
DelegatingParser, ParserPostProcessor...any others?  Or, do we need to set the 
parameters in the wrapped parser before wrapping?

Questions for the broader dev community:

A) Are we ok with Map<String,String> parameters? Or should we follow, say, 
Solr's syntax for type checking?
{{noformat}}
<int name="pageWidth">10</int>
{{noformat}}

B) We could use reflection to get around each parser having to add its own 
configuration code.  We could create a static configurator  that has a 
{{configure(Configurable configurable, Map<String, String> params}} method.  
That isn't quite right, because we'd have to know the type for each param (see 
above), but something along those lines.  Too complex?

> Add uniformity to parser parameter configuration
> ------------------------------------------------
>
>                 Key: TIKA-1508
>                 URL: https://issues.apache.org/jira/browse/TIKA-1508
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>             Fix For: 1.13
>
>
> We can currently configure parsers by the following means:
> 1) programmatically by direct calls to the parsers or their config objects
> 2) sending in a config object through the ParseContext
> 3) modifying .properties files for specific parsers (e.g. PDFParser)
> Rather than scattering the landscape with .properties files for each parser, 
> it would be great if we could specify parser parameters in the main config 
> file, something along the lines of this:
> {noformat}
>     <parser class="org.apache.tika.parser.audio.AudioParser">
>       <params>
>         <int name="someparam1">2</int>
>         <str name="someOtherParam2">something or other</str>
>       </params>
>       <mime>audio/basic</mime>
>       <mime>audio/x-aiff</mime>
>       <mime>audio/x-wav</mime>
>     </parser>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to