[
https://issues.apache.org/jira/browse/TIKA-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000680#comment-15000680
]
Nick Burch commented on TIKA-980:
---------------------------------
Taking a look at {{TIKA-980-1.3-5.patch}}, there's some {{System.out}} calls in
the unit test which would need removing/replacing with asserts as starters
My only other question - is a special ContentHandler with strict rules on input
(needing html mappers setting on the context to work), which returns objects,
the right way to go? Or should we be trying to map these Microdata blocks into
the regular Metadata? (With a suitable set of keys/prefixes). Can someone who
knows the Microdata world well comment on why it's been done as it has, and not
via Metadata properties?
> MicrodataContentHandler for Apache Tika
> ---------------------------------------
>
> Key: TIKA-980
> URL: https://issues.apache.org/jira/browse/TIKA-980
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Markus Jelsma
> Assignee: Ken Krugler
> Fix For: 1.12
>
> Attachments: TIKA-980-1.3-1.patch, TIKA-980-1.3-2.patch,
> TIKA-980-1.3-3.patch, TIKA-980-1.3-4.patch, TIKA-980-1.3-5.patch
>
>
> ContentHandler for Apache Tika capable of building a data structure
> containing Microdata item scopes and item properties. The Item* classes are
> borrowed from the Apache Any23 project and are slightly modified to
> accomodate this SAX-based extractor vs the original DOM-based extractor.
> The provided unit test outputs two item scopes about the Europe and NA
> ApacheCon events and each has a nested property.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)