Nicholas DiPiazza created TIKA-3221:
---------------------------------------

             Summary: /rmeta/text endpoint - allow a "max parse time" parameter 
where after exceeded, return bytes/metadata mangaed to get up to that point
                 Key: TIKA-3221
                 URL: https://issues.apache.org/jira/browse/TIKA-3221
             Project: Tika
          Issue Type: Bug
            Reporter: Nicholas DiPiazza


This might be either A) impossible or B) a boat load of work but...

make change to the 

{code}
/rmeta/text
{code}

endpoint to allow a "max parse time" parameter where after exceeded, return 
bytes/metadata managed to get up to that point.

Motivation:

I have a massive number of documents that I need to fetch through apache tika 
server.

Prior to making a switch to tika server, I used a project I created myself that 
created tika forked VMs and would send work to the VMs through sockets directly.

This was OK but super complicated so I chose to switch to the Tika jetty server 
for simplicity's sake.

Works great for the most part. But one feature I had before was that I could 
say "If I don't get a result within MAX_PARSE_TIMEOUT_MS, then stop parsing at 
that moment and return the bytes we managed to get up to that point.

This is because with the massive number of documents I need to parse, I cannot 
afford to have any parse hang longer than a certain amount of time. But 
conversely, if I make timeout 20 seconds, then I suffer massive gaps with *no* 
content at all.

With the rmeta/text method, we recently added the ability to send a writeLimit 
where we will stop parsing after we reach that number of bytes.

I'm hoping we can do the same for the time parsed. Perhaps when checking byte 
size, periodically check time and quit parser in the same way. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to