[jira] [Commented] (TIKA-1436) improvement to PDFParser

Stefano Fornari (JIRA) Sat, 07 Feb 2015 01:48:50 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310638#comment-14310638
 ]


Stefano Fornari commented on TIKA-1436:
---------------------------------------

ups, I did not notice this needed some background. As per the mentioned thread 
on the mailing list, which I am reporting below for your conveninece, I believe 
there was consensus that the current pattern is not the best and it is 
difficult to understand. I am not sure instead what you report about many not 
related changes in method/variables. I quickly had a look at the patch and I 
could not find any. can you please point it out?

thanks in advance,

    > On #2, I expected the code you presented would not work. And in fact the
    > pattern is quite odd, isn't it? What is the reason of throwing the
    > exception if limiting the text read is a legal use case? (I am asking just
    > to understand the background).

    Yes, the pattern is a bit awkward and generally shouldn't be
    recommended as it uses an exception to control the flow of the
    program. However, in this case we considered it worth doing as the
    alternative would have been far more complicated.

    Basically we wanted to avoid having to modify each parser
    implementation (even those implemented outside Tika...) to keep track
    of how much content has already been extracted and instead do that
    just once in the WriteOutContentHandler class. However, the only way
    for the WriteOutContentHandler to signal that parsing should be
    stopped is by throwing a SAXException, which is what we're doing here.
    By catching the exception and inspecting it with isWriteLimitReached()
    the client can determine whether this is what happened.

    BR,

    Jukka Zitting

> improvement to PDFParser
> ------------------------
>
>                 Key: TIKA-1436
>                 URL: https://issues.apache.org/jira/browse/TIKA-1436
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Stefano Fornari
>              Labels: parser, pdf
>         Attachments: ste-20140927.patch
>
>
> with regards to the thread "[PDFParser] - read limited number of characters" 
> on Mar 29, I would like to propose the attached patch. I noticed that in Tika 
> 1.6 there have been some work around a better handling of the 
> WriteLimitReachedException condition, but I believe it could be even 
> improved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1436) improvement to PDFParser

Reply via email to