[
https://issues.apache.org/jira/browse/TIKA-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093316#comment-18093316
]
Hudson commented on TIKA-4770:
------------------------------
ABORTED: Integrated in Jenkins build Tika ยป tika-main-jdk17 #1453 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/1453/])
TIKA-4770: Add a Markdown parser with structured, lossless XHTML output (#2922)
(github:
[https://github.com/apache/tika/commit/aca20dc144a398420bc2193ad97303c81fe2fff1])
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/test/java/org/apache/tika/parser/datauri/DataURISchemeParserTest.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/java/org/apache/tika/parser/markdown/MarkdownParserTest.java
* (edit) tika-bom/pom.xml
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/pom.xml
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURIScheme.java
* (delete)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/DataURISchemeParserTest.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml
* (delete)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURISchemeParseException.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURISchemeParseException.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-datauri-commons/src/main/java/org/apache/tika/parser/datauri/DataURISchemeUtil.java
* (delete)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURISchemeUtil.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/config/TikaEncodingDetectorTest.java
* (delete)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/main/java/org/apache/tika/parser/html/DataURIScheme.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/main/java/org/apache/tika/parser/markdown/MarkdownParser.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/pom.xml
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/pom.xml
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/test-documents/testMARKDOWN.md
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-text-module/src/test/resources/test-documents/testMARKDOWN_dataURI.md
> Add a Markdown parser with structured XHTML output
> --------------------------------------------------
>
> Key: TIKA-4770
> URL: https://issues.apache.org/jira/browse/TIKA-4770
> Project: Tika
> Issue Type: Task
> Components: parser
> Reporter: Kristian Rickert
> Priority: Major
>
> Markdown files are already detected as {{text/markdown}} ({{\*.md}} /
> {{\*.markdown}} globs in tika-mimetypes.xml), but no parser claims the type,
> so they fall through to {{TXTParser}} and come back as flat text.
> Add a {{MarkdownParser}} to {{tika-parser-text-module}} using commonmark-java
> (already a Tika dependency behind {{ToMarkdownContentHandler}}, TIKA-4730).
> It parses the markdown AST and emits structured XHTML: {{h1..h6}},
> {{ul}}/{{ol}}/{{li}}, {{blockquote}}, {{pre}}/{{code}} with a language class,
> GFM tables as {{table}}/{{thead}}/{{tbody}}/{{tr}}/{{th}}/{{td}} with
> alignment, {{em}}/{{strong}}/{{del}}, links, images, {{hr}}. Raw HTML in the
> source is emitted as escaped text so nothing is injected. Encoding detection
> via {{AutoDetectReader}}, consistent with {{TXTParser}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)