[
https://issues.apache.org/jira/browse/TIKA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Burchard updated TIKA-3643:
--------------------------------
Description:
[~jmssiera] wrote up the enhancement request TIKA-3325 where he originally
requested that the number of bytes be passed as the write limit. I see that
issue was marked as Resolved, but writeLimit is number of chars instead of
number of bytes.
I have a use-case where the consumer side (an indexer) has a control for the
maximum number of bytes to index. When I'm using the writeLimit header with
Tika and I'm extracting text from a document with mixed ASCII and multi-byte
characters I can't get back exactly, for example, 6MB worth of text because I
don't know a-priori what chars will be in the file.
My ask here is for a new control, maybe "writeLimitBytes" where the number of
characters returned breaks on the last coherent character. Therefore the
returned text would be <= writeLimitBytes but would more or less be close to
that value.
was:
[~jmssiera] wrote up the enhancement request TIKA-3325 where he originally
requested that the number of bytes be passed as the write limit. I see that
issue was marked as Resolved, but writeLimit is number of chars instead of
number of bytes.
I have a use-case where the consumer side (an indexer) has a control for the
maximum number of bytes to index. When I'm using the writeLimit header with
Tika and I'm extracting text from a document with mixed ASCII and multi-byte
characters I can't get back exactly 6MB worth of text because I don't know
a-priori what chars will be in the file.
My ask here is for a new control, maybe "writeLimitBytes" where the number of
characters returned breaks on the last coherent character. Therefore the
returned text would be <= writeLimitBytes but would more or less be close to
that value.
> writeLimit for bytes in addition to characters
> ----------------------------------------------
>
> Key: TIKA-3643
> URL: https://issues.apache.org/jira/browse/TIKA-3643
> Project: Tika
> Issue Type: Improvement
> Components: core
> Affects Versions: 2.2.1
> Reporter: Josh Burchard
> Priority: Major
>
> [~jmssiera] wrote up the enhancement request TIKA-3325 where he originally
> requested that the number of bytes be passed as the write limit. I see that
> issue was marked as Resolved, but writeLimit is number of chars instead of
> number of bytes.
> I have a use-case where the consumer side (an indexer) has a control for the
> maximum number of bytes to index. When I'm using the writeLimit header with
> Tika and I'm extracting text from a document with mixed ASCII and multi-byte
> characters I can't get back exactly, for example, 6MB worth of text because I
> don't know a-priori what chars will be in the file.
> My ask here is for a new control, maybe "writeLimitBytes" where the number of
> characters returned breaks on the last coherent character. Therefore the
> returned text would be <= writeLimitBytes but would more or less be close to
> that value.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)