I am +1 for this. Option #2 sounds like a slick way to handle this for me that 
would
remain back compat with tika-python which is of strong interest to me.

Cheers,
Chris




On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote:

    Hi folks,
    
    if I am not wrong, currently you cannot configure a specific ContentHandler
    while using tika-server. I mean that you can configure your own parser [0]
    but you cannot control which ContentHandler the parser leverages to extract
    text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    StandardsExtractingContentHandler, etc).
    If it is correct, it would be nice to enable the use of specific
    ContentHandlers within tika-server and I would like to discuss how to solve
    this issue generally.
    
    I propose two solutions:
    
       1. augment the TikaConfig class so that a specific ContentHandler can be
       used in tika-config.xml;
       2. determine the ContentHandler to use for parsing through HTTP headers,
       for example:
       curl -T filename.pdf http://localhost:9998/meta --header
       "X-Content-Handler: PhoneExtractingContentHandler"
       This should affect also the TikaResource.java class.
    
    I look forward to having your feedback. I strongly believe that every user
    who wants to use Tika as a service through tika-server and needs to extract
    content and metadata like phone numbers, standard references, etc would be
    very happy.
    
    Thanks a lot,
    Giuseppe
    


Reply via email to