Hi

I've started to use Tika a couple of days ago, so it could very well be
that I'm using the wrong ContentHandler, Parser configuration and what not.
I hope I do, and there's a simple fix to the following problem:

I index documents (for this discussion PPT) and then search and produce
search highlights (using Lucene). I've noticed that the PowerPoint
documents produce rather longish highlights. I use Lucene's
PostingsHighlighter which breaks the content using
BreakIterator.sentenceInstance (
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html), and
for PPT documents, which often (I guess) do not contain sentence breaks
(e.g. '.') at the end of bullets, this results in very long sentences.

I wrote a simple program which parses a PPT file with one slide that looks
like this:

Slide title

   - Short bullet
   - Long bullet which will eventually end with a dot, but not just yet.
   - Long bullet which doesn't end with a dot, not now and not ever
   - A bullet which
   is split into
   multiple lines

That's it, very simple. What I would expect (or hoped!) is that 5 sentences
will be output, 1 for the slide's title and one for each bullet. But
rather, if I parse the file with BodyContentHandler, and then invoke the
sentence BreakIterator, I get this:

*
*
*

Slide Title
Short bullet
Long bullet which will end eventually with a dot, but not just yet.

++++++
Long  bullet which doesn’t end with a dot, not now and not ever
A bullet which is split into multiple lines





++++++

The '++++++' are marks that I print after each sentence the BreakIterator
detects. Here's the code which invokes the iterator:

    BreakIterator iterator = BreakIterator.getSentenceInstance();
    iterator.setText(content);
    for (int start = iterator.first(), end = iterator.next(); end !=
BreakIterator.DONE; start = end, end = iterator.next()) {
        System.out.println(content.substring(start, end));
        System.out.println("++++++");
    }

As you can see, the bullet which ends with a dot '.' also results in a new
sentence. And if I remove the '.', so is the sentence end print removed as
well.

I then thought perhaps I should get the "raw" output from Tika, and
followed TikaCLI code to use TransformerHandler (with method "xml") in
order to get the output XML. I thought that perhaps by doing that I can
replace whatever markers Tika puts with sentence breaks, be it <br/> or
</p>, but I don't see such markers:

...
<body><div class="slideShow"><div class="slide"><p
class="slide-master-content">*<br/>
*<br/>
*<br/>
</p>
<p class="slide-content">Slide Title<br/>
Short bullet
Long bullet which will eventually end with a dot, but not just yet.

++++++
Long  bullet which doesn’t end with a dot, not now and not never
A bullet which is split into multiple lines<br/>
</p>
</div>
</div>
<div class="slideNotes"/>
</body>

Is there a way I can make Tika output sentence boundaries for such bullets?
Or maybe output a marker which i can then replace w/ a valid sentence break
(there are few I can pick according to
http://www.unicode.org/reports/tr29/#Sentence_Boundaries).

I did notice there are \n characters in the output text, but I don't think
it's very generic to replace every \n with a '.', as the multi-line bullet
shows?

Shai

Reply via email to