[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002216#comment-15002216
]
Markus Jelsma commented on TIKA-980:
------------------------------------
Hello Nick - the identity mapper is required because without it, tags such as
time, meta and many others are not passed to the content handler so no
properties can be extracted from it.
Regarding mapping microdata properties to regular metadata, keep in mind
microdata is nested, you can have many identical properties in different nested
blocks (see unit test).
There is also the problem of TIKA-1782, if the top itemscope on the body tag is
moved to the html tag, it should still work, but it doesn't appear to.
> MicrodataContentHandler for Apache Tika
> ---------------------------------------
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Markus Jelsma
> Assignee: Ken Krugler
> Fix For: 1.12
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch,
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure
> containing Microdata item scopes and item properties. The Item* classes are
> borrowed from the Apache Any23 project and are slightly modified to
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA
> ApacheCon events and each has a nested property.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)