[ https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707366#comment-16707366 ]
Tim Allison commented on TIKA-2550: ----------------------------------- [~lfcnassif], y, I was worried about breaking things, and I'm willing to revert this and find a different solution. I just added a unit test to confirm that script elements are still being extracted when the HTMLParser is configured to extract them and the ToTextHandler is being used. I also checked legacy behavior, and scripts are not coming through in the ToTextHandler from htmls with scripts...so there's no change in behavior there. But still, this could break things...Let me know if I should revert this and create a new handler or otherwise fix the extraction so that we're not getting style info in the "text" for Java source files. > ToTextHandler includes <style/> element content > ----------------------------------------------- > > Key: TIKA-2550 > URL: https://issues.apache.org/jira/browse/TIKA-2550 > Project: Tika > Issue Type: Bug > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Trivial > Fix For: 2.0.0, 1.20 > > > When using the ToTextHandler to process .java files, the <style/> element > content is included, e.g.: > {noformat} > testFile > code { > color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: > nowrap; > } > .java_plain { > color: rgb(0,0,0); > } > .java_keyword { > color: rgb(0,0,0); font-weight: bold; > } > .java_javadoc_tag { > color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: > italic; font-weight: bold; > } > h1 { > font-family: sans-serif; font-size: 16pt; font-weight: bold; color: > rgb(0,0,0); background: rgb(210,210,210); border: solid 1px black; padding: > 5px; text-align: center; > } > .java_type { > color: rgb(0,44,221); > } > .java_literal { > color: rgb(188,0,0); > } > .java_javadoc_comment { > color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: > italic; > } > .java_operator { > color: rgb(0,124,31); > } > .java_separator { > color: rgb(0,33,255); > } > .java_comment { > color: rgb(147,147,147); background-color: rgb(247,247,247); > } > testFile/************************************************************************* > * Compilation: javac HelloWorld.java > * Execution: java HelloWorld > * > * Prints "Hello, World". By tradition, this is everyone's first program. > * > *************************************************************************/ > public class HelloWorld { > public static void main(String[] args) { > System.out.println("Hello, World"); > } > } > {noformat} > Is this what we want as the default behavior? -- This message was sent by Atlassian JIRA (v7.6.3#76005)