Kristian Rickert created TIKA-4770:
--------------------------------------

             Summary: Add a Markdown parser with structured XHTML output
                 Key: TIKA-4770
                 URL: https://issues.apache.org/jira/browse/TIKA-4770
             Project: Tika
          Issue Type: Task
          Components: parser
            Reporter: Kristian Rickert


Markdown files are already detected as {{text/markdown}} ({{\*.md}} / 
{{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type, so 
they fall through to {{TXTParser}} and come back as flat text.

Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java 
(already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730). It 
parses the markdown AST and emits structured XHTML: {{h1..h6}}, 
{{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class, 
GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with 
alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the 
source is emitted as escaped text so nothing is injected. Encoding detection 
via {{AutoDetectReader}}, consistent with {{TXTParser}}.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to