Hi I've started to use Tika a couple of days ago, so it could very well be that I'm using the wrong ContentHandler, Parser configuration and what not. I hope I do, and there's a simple fix to the following problem:
I index documents (for this discussion PPT) and then search and produce search highlights (using Lucene). I've noticed that the PowerPoint documents produce rather longish highlights. I use Lucene's PostingsHighlighter which breaks the content using BreakIterator.sentenceInstance ( http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html), and for PPT documents, which often (I guess) do not contain sentence breaks (e.g. '.') at the end of bullets, this results in very long sentences. I wrote a simple program which parses a PPT file with one slide that looks like this: Slide title - Short bullet - Long bullet which will eventually end with a dot, but not just yet. - Long bullet which doesn't end with a dot, not now and not ever - A bullet which is split into multiple lines That's it, very simple. What I would expect (or hoped!) is that 5 sentences will be output, 1 for the slide's title and one for each bullet. But rather, if I parse the file with BodyContentHandler, and then invoke the sentence BreakIterator, I get this: * * * Slide Title Short bullet Long bullet which will end eventually with a dot, but not just yet. ++++++ Long bullet which doesn’t end with a dot, not now and not ever A bullet which is split into multiple lines ++++++ The '++++++' are marks that I print after each sentence the BreakIterator detects. Here's the code which invokes the iterator: BreakIterator iterator = BreakIterator.getSentenceInstance(); iterator.setText(content); for (int start = iterator.first(), end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) { System.out.println(content.substring(start, end)); System.out.println("++++++"); } As you can see, the bullet which ends with a dot '.' also results in a new sentence. And if I remove the '.', so is the sentence end print removed as well. I then thought perhaps I should get the "raw" output from Tika, and followed TikaCLI code to use TransformerHandler (with method "xml") in order to get the output XML. I thought that perhaps by doing that I can replace whatever markers Tika puts with sentence breaks, be it <br/> or </p>, but I don't see such markers: ... <body><div class="slideShow"><div class="slide"><p class="slide-master-content">*<br/> *<br/> *<br/> </p> <p class="slide-content">Slide Title<br/> Short bullet Long bullet which will eventually end with a dot, but not just yet. ++++++ Long bullet which doesn’t end with a dot, not now and not never A bullet which is split into multiple lines<br/> </p> </div> </div> <div class="slideNotes"/> </body> Is there a way I can make Tika output sentence boundaries for such bullets? Or maybe output a marker which i can then replace w/ a valid sentence break (there are few I can pick according to http://www.unicode.org/reports/tr29/#Sentence_Boundaries). I did notice there are \n characters in the output text, but I don't think it's very generic to replace every \n with a '.', as the multi-line bullet shows? Shai
