[ https://issues.apache.org/jira/browse/TIKA-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071942#comment-14071942 ]
Tyler Palsulich edited comment on TIKA-1373 at 7/23/14 4:52 PM: ---------------------------------------------------------------- The only SAX event in SourceCodeParser is {{xhtml.element("p", codeAsHtml);}}. codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_ like the --text isn't returning the text, but it's just that the text content is html. I'm not sure how we can turn the jhighlight html tags into SAX events. Tika HtmlParser? Something like {code} XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); Renderer renderer = getRenderer(type.toString()); String content = out.toString(); String codeAsHtml = renderer.highlight(name, content, charset.name(), false); HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, metadata, context); {code} was (Author: tpalsulich): The only SAX event in SourceCodeParser is {{xhtml.element("p", codeAsHtml);}}. codeAsHtml is formatted by jhighlight, a syntax highlighter. So, it _looks_ like the --text is returning the text, but it's just that the text content is html. I'm not sure how we can turn the jhighlight html tags into SAX events. Tika HtmlParser? Something like {code} XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); Renderer renderer = getRenderer(type.toString()); String content = out.toString(); String codeAsHtml = renderer.highlight(name, content, charset.name(), false); HtmlParser htmlParser = new HtmlParser(); htmlParser.parse(new ByteArrayInputStream(content.getBytes()), xhtml, metadata, context); {code} > AutoDetectParser extracts no text when SourceCodeParser is selected > ------------------------------------------------------------------- > > Key: TIKA-1373 > URL: https://issues.apache.org/jira/browse/TIKA-1373 > Project: Tika > Issue Type: Bug > Affects Versions: 1.5 > Reporter: Andrés Aguilar-Umaña > > When using the AutoDetectParser in java code, and the SourceCodeParser is > selected (i.e. java files), the handler gets no text: > I have this test program: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/x-java-source"); > try { > autoDetectParser.parse(bais, bch, metadata, parseContext); > } catch (Exception e) { > e.printStackTrace(); > } > System.out.println("Text extracted: "+bch.toString()) > {code} > It returns (using the SourceCodeParser): > {code} > Text extracted: {code} > But when I use this code: > {code} > String data = "public class HelloWorld {}"; > ByteArrayInputStream bais = new ByteArrayInputStream(data.getBytes()); > Parser autoDetectParser = new AutoDetectParser(); > BodyContentHandler bch = new BodyContentHandler(50); > ParseContext parseContext = new ParseContext(); > Metadata metadata = new Metadata(); > metadata.set(Metadata.CONTENT_TYPE, "text/plain"); > try { autoDetectParser.parse(bais, bch, metadata, parseContext); } > catch (Exception e) { e.printStackTrace(); } > System.out.println("Text extracted: "+bch.toString()) > {code} > The Text Parser is used and I get: > {code} > Text extracted: public class HelloWorld {} {code} > I have also tested this command: > {code} > > java -jar tika-app-1.5.jar -t D:\text.java > (no text) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)