Why don’t we just store N copies of the stream, and parse it twice?

Of course that’s the ugly way, but currently the way I’ve hacked this in all of
my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just 
use
that as the weakest baseline and work backwards from there?

Chris




On 10/26/17, 3:56 AM, "Nick Burch" <[email protected]> wrote:

    Hi All
    
    Based on the plan on the wiki 
    <https://wiki.apache.org/tika/Tika2_0RoadMap> 
    <https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a 
    major breaking change or two planned for Tika 2 that we haven't yet 
    "broken". (There's also removing some deprecated stuff etc)
    
    
    As I understand it, the biggest breaking TODO change is around having 
    multiple parsers available + active for a given format. This could be to 
    support fallback parsers, eg "try this fancy new parser, but if it falls 
    retry with this simpler one" or "try this xml parser, if that fails just 
    try strings". A related but different case is to cleanly support multiple 
    parsers covering different aspects, eg OCR an image plus extract metadata, 
    or NER on the contents of a scientific PDF + text + metadata + NER of the 
    OCR of embedded images in the PDF.
    
    Currently, we can't cleanly do the former, and the latter is (badly) 
    handled via one parser (eg OCR or NER) having an embedded hard-code 
    reference to another (eg Image or PDF).
    
    
    We've got some details on the proposed plans and ideas on the wiki:
    https://wiki.apache.org/tika/CompositeParserDiscussion
    
    The biggest stumbling block, as I see it, is how to let multiple parsers 
    interact with the SAX content handler. For the fallback case, that's how 
    to say "sorry, ignore all that XML we already sent, we're starting again 
    with this XML now". For the multiple parser case, it's how we could have 
    the image parser "finish" the (empty) XHTML but then have the OCR one send 
    some text, or have the NER parser get at the XHTML text of the PDF + OCR 
    of embedded images to enhance with the entities.
    
    
    What do we think for this? Can we come up with a solution to let this go 
    forward? Is there a pattern from elsewhere we can follow?
    
    Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and 
    do this stuff in Tika 3 instead?
    
    Nick
    


Reply via email to