Mine too, but I know it is important for many use cases. Maybe adding to
XHtmlContentHandler some tracking of open tags and a new method to close
them?

2018-02-07 12:59 GMT-02:00 Allison, Timothy B. <talli...@mitre.org>:

> Do we worry about properly closing tags on an exception?
>
> <body>
>         <div parser="parser1">
>                 <p>
> kaboom
>         <div parser="parser2>
> ....
>
> My focus is normally text so broken tags aren't a problem for me...but
> others?
>
> -----Original Message-----
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> Sent: Monday, February 5, 2018 5:34 PM
> To: dev@tika.apache.org
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> From a forensic use case it is better just saying we are trying another
> parser and not resetting the content handler, because the first parser can
> extract relevant content before the exception.
>
> To not spool everything to temp files to re-read the stream, I think we
> can create an optional setinputstreamfactory() method in TikaInputStream,
> so the user can implement an InputStreamFactory interface with a
> getInputStream method, if he does not want to pay a performance hit with
> temp files for everything.
>
> Luis
>
> Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <mattm...@apache.org>
> escreveu:
>
> I think we should just say, OK now we're trying  a different parser....
>
>
>
> On 2/5/18, 9:51 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
>     To my mind, the real challenge is what to do with content that should
> be ignored...
>
>     If the strategy is back-off-on-exception (try the DOCX parser, but if
> there's an exception, use the Zip parser), what do we do with the sax
> elements that have already been written?  Do we need a new handler type
> that has a reset() method?
>
>     Or do we just say, hey, now we're trying a different parser...
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
>     Sent: Monday, February 5, 2018 12:29 PM
>     To: dev@tika.apache.org
>     Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
>     Our solution is just to run the parser 2x....yes I get it will induce
> overhead, but as a start, why not?
>     In short just run through the stream 2x....
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Associate Chief Technology and Innovation Officer, OCIO Manager,
> Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
> and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-502
>     Email: chris.a.mattm...@nasa.gov
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 2/5/18, 9:25 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>
>         On Mon, 5 Feb 2018, Chris Mattmann wrote:
>         > Let's have a go at implementing it! You know my thoughts (make
> it like
>         > OODT ;) )\
>
>         I'm still keen to hear how we can do the text content like OODT!
>
>         I have tried to copy the OODT model for the proposed metadata case
> though
>         :)
>
>         Nick
>
>         > On 2/5/18, 8:37 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>         >
>         >    Ping - anyone got any thoughts on the proposed metadata parser
> stuff, and
>         >    any ideas on the content part?
>         >
>         >    On Tue, 2 Jan 2018, Nick Burch wrote:
>         >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >> On collision, the precedence order defines what key takes
> precedence and
>         >    >> _overwrites_ the other. Overwrite is but one option (you
> could save *all*
>         >    >> the values it’s a multi-valued key structure so…)
>         >    >
>         >    > OK, I think that's fine. I've had a go at updating the wiki
> for the metadata
>         >    > case:
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> Supplementary.2FAdditive
>         >    > And example Tika Config settings for it
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> line-20
>         >    > If people are happy with how that sounds/looks, I can have a
> stab at
>         >    > implementing it, as I *think* it's quite easy
>         >    >
>         >    >
>         >    > However... that still leaves the Context (XHTML SAX events)
> case to solve!
>         >    >
>         >    > Anyone have any ideas on how we can append to or
> cancel/reset the Content
>         >    > Handler series of SAX events when we move onto a second+
> parser for a file?
>         >    >
>         >    > Thanks
>         >    > Nick
>         >    >
>         >    >> On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org>
> wrote:
>         >    >>
>         >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >>    > My general approach to conflicting metadata is simply
> to define
>         >    >>    > precedence orders.
>         >    >>    >
>         >    >>    > For example here is one documented from OODT:
>         >    >>    >
>         >    >>    >
>         >    >> https://cwiki.apache.org/confluence/display/OODT/
> Understanding+CAS-PGE+Metadata+Precendence
>         >    >>    >
>         >    >>    > We can do similar things with Tika, e.g.,
>         >    >>    >
>         >    >>    > [CoreMetadata.PROPERTIES]
>         >    >>    > [ImageParser.METADATA]
>         >    >>    > [TikaOCR.METADATA]
>         >    >>
>         >    >>    What happens if two different parsers both output the
> same bit of
>         >    >> metadata
>         >    >>    though? eg Tim's example of one giving dc:creator of Tim
> and the second
>         >    >>    giving dc:creator of Chris?
>         >    >>
>         >    >>
>         >    >>    Secondly, what about the XHTML sax events stream? I
> think that's
>         >    >> probably
>         >    >>    the harder case...
>         >    >>
>         >    >>    Nick
>         >
>         >
>         >
>

Reply via email to