[
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310638#comment-14310638
]
Stefano Fornari commented on TIKA-1436:
---------------------------------------
ups, I did not notice this needed some background. As per the mentioned thread
on the mailing list, which I am reporting below for your conveninece, I believe
there was consensus that the current pattern is not the best and it is
difficult to understand. I am not sure instead what you report about many not
related changes in method/variables. I quickly had a look at the patch and I
could not find any. can you please point it out?
thanks in advance,
> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).
Yes, the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the
program. However, in this case we considered it worth doing as the
alternative would have been far more complicated.
Basically we wanted to avoid having to modify each parser
implementation (even those implemented outside Tika...) to keep track
of how much content has already been extracted and instead do that
just once in the WriteOutContentHandler class. However, the only way
for the WriteOutContentHandler to signal that parsing should be
stopped is by throwing a SAXException, which is what we're doing here.
By catching the exception and inspecting it with isWriteLimitReached()
the client can determine whether this is what happened.
BR,
Jukka Zitting
> improvement to PDFParser
> ------------------------
>
> Key: TIKA-1436
> URL: https://issues.apache.org/jira/browse/TIKA-1436
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.6
> Reporter: Stefano Fornari
> Labels: parser, pdf
> Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters"
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika
> 1.6 there have been some work around a better handling of the
> WriteLimitReachedException condition, but I believe it could be even
> improved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)