RE: Not-yet-broken breaking changes for Tika 2?

Allison, Timothy B. Wed, 07 Feb 2018 07:00:39 -0800

Do we worry about properly closing tags on an exception?

<body>
        <div parser="parser1">
                <p>
kaboom
        <div parser="parser2>
....


My focus is normally text so broken tags aren't a problem for me...but others?

-----Original Message-----
From: Luís Filipe Nassif [mailto:[email protected]] 
Sent: Monday, February 5, 2018 5:34 PM
To: [email protected]
Subject: Re: Not-yet-broken breaking changes for Tika 2?

From a forensic use case it is better just saying we are trying another parser 
and not resetting the content handler, because the first parser can extract 
relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can 
create an optional setinputstreamfactory() method in TikaInputStream, so the 
user can implement an InputStreamFactory interface with a getInputStream 
method, if he does not want to pay a performance hit with temp files for 
everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <[email protected]>
escreveu:

I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <[email protected]> wrote:

    To my mind, the real challenge is what to do with content that should be 
ignored...

    If the strategy is back-off-on-exception (try the DOCX parser, but if 
there's an exception, use the Zip parser), what do we do with the sax elements 
that have already been written?  Do we need a new handler type that has a 
reset() method?

    Or do we just say, hey, now we're trying a different parser...


    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:[email protected]]
    Sent: Monday, February 5, 2018 12:29 PM
    To: [email protected]
    Subject: Re: Not-yet-broken breaking changes for Tika 2?

    Our solution is just to run the parser 2x....yes I get it will induce 
overhead, but as a start, why not?
    In short just run through the stream 2x....

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced 
IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source 
Programs and Applications Office (8212) NASA Jet Propulsion Laboratory 
Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: [email protected]
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++


    On 2/5/18, 9:25 AM, "Nick Burch" <[email protected]> wrote:

        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it like
        > OODT ;) )\

        I'm still keen to hear how we can do the text content like OODT!

        I have tried to copy the OODT model for the proposed metadata case 
though
        :)

        Nick

        > On 2/5/18, 8:37 AM, "Nick Burch" <[email protected]> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes
precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you
could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki
for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
        >    > If people are happy with how that sounds/looks, I can have a
stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events)
case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or
cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+
parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]>
wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply
to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the
same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim
and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I
think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >

RE: Not-yet-broken breaking changes for Tika 2?

Reply via email to