Josh Burchard created TIKA-3643:
-----------------------------------

             Summary: writeLimit for bytes in addition to characters
                 Key: TIKA-3643
                 URL: https://issues.apache.org/jira/browse/TIKA-3643
             Project: Tika
          Issue Type: Improvement
          Components: core
    Affects Versions: 2.2.1
            Reporter: Josh Burchard


[~jmssiera] wrote up the enhancement request TIKA-3325 where he originally 
requested that the number of bytes be passed as the write limit.  I see that 
issue was marked as Resolved, but writeLimit is number of chars instead of 
number of bytes.

I have a use-case where the consumer side (an indexer) has a control for the 
maximum number of bytes to index.  When I'm using the writeLimit header with 
Tika and I'm extracting text from a document with mixed ASCII and multi-byte 
characters I can't get back exactly 6MB worth of text because I don't know 
a-priori what chars will be in the file.   

My ask here is for a new control, maybe "writeLimitBytes" where the number of 
characters returned breaks on the last coherent character.  Therefore the 
returned text would be <= writeLimitBytes but would more or less be close to 
that value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to