https://issues.apache.org/bugzilla/show_bug.cgi?id=51803

--- Comment #3 from Yegor Kozlov <[email protected]> 2011-11-07 13:43:00 UTC ---
MasterSlide.getTextRuns is used in other places and should return all runs
including boilerplate ones from placeholders. I think the correct fix would be
as follows: The line 224 in PowerPointExtractor.java invokes textRunsToText,
but this method can't tell runs from placeholders from normal text.

                    textRunsToText(ret, master.getTextRuns());

It is better to re-write it and iterate over shapes in the master sheet:

                    for(Shape sh : master.getShapes()){
                        if(sh instanceof TextShape){
                            if(MasterSheet.isPlaceholder(sh)) {
                                // don't bother about boiler plate text on
master sheets
                                continue;
                            }
                            TextShape tsh = (TextShape)sh;
                            String text = tsh.getText();
                            ret.append(text);
                            if (!text.endsWith("\n")) {
                                ret.append("\n");
                            }
                        }
                    }

Any volunteers to help me with testing? I never worked with text extractors and
don't want to occasionally break things. I guess the best would be  to apply
this fix, build and test from inside Tika. 

Yegor

(In reply to comment #2)
> I think there is still a problem here: with the example PPT I
> attached, I see boiler-plate text when I run PowerPointExtract (which
> does set to flag to include master slide text, in its static main
> method).
> 
> I see code in HSLF for detecting that a given Shape is a placeholder
> (MasterSheet.isPlaceholder), so it seems possible we can avoid
> extracting such text?  But I'm not familiar enough with the APIs, eg
> when Sheet.findTextRuns is invoked for a MasterSlide, how can it get
> the Shape for each run and then skip its text if it's a placeholder?

(In reply to comment #2)
> I think there is still a problem here: with the example PPT I
> attached, I see boiler-plate text when I run PowerPointExtract (which
> does set to flag to include master slide text, in its static main
> method).
> 
> I see code in HSLF for detecting that a given Shape is a placeholder
> (MasterSheet.isPlaceholder), so it seems possible we can avoid
> extracting such text?  But I'm not familiar enough with the APIs, eg
> when Sheet.findTextRuns is invoked for a MasterSlide, how can it get
> the Shape for each run and then skip its text if it's a placeholder?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to