https://issues.apache.org/bugzilla/show_bug.cgi?id=51803
--- Comment #3 from Yegor Kozlov <[email protected]> 2011-11-07 13:43:00 UTC --- MasterSlide.getTextRuns is used in other places and should return all runs including boilerplate ones from placeholders. I think the correct fix would be as follows: The line 224 in PowerPointExtractor.java invokes textRunsToText, but this method can't tell runs from placeholders from normal text. textRunsToText(ret, master.getTextRuns()); It is better to re-write it and iterate over shapes in the master sheet: for(Shape sh : master.getShapes()){ if(sh instanceof TextShape){ if(MasterSheet.isPlaceholder(sh)) { // don't bother about boiler plate text on master sheets continue; } TextShape tsh = (TextShape)sh; String text = tsh.getText(); ret.append(text); if (!text.endsWith("\n")) { ret.append("\n"); } } } Any volunteers to help me with testing? I never worked with text extractors and don't want to occasionally break things. I guess the best would be to apply this fix, build and test from inside Tika. Yegor (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder? (In reply to comment #2) > I think there is still a problem here: with the example PPT I > attached, I see boiler-plate text when I run PowerPointExtract (which > does set to flag to include master slide text, in its static main > method). > > I see code in HSLF for detecting that a given Shape is a placeholder > (MasterSheet.isPlaceholder), so it seems possible we can avoid > extracting such text? But I'm not familiar enough with the APIs, eg > when Sheet.findTextRuns is invoked for a MasterSlide, how can it get > the Shape for each run and then skip its text if it's a placeholder? -- Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
