[
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178399#comment-16178399
]
ASF GitHub Bot commented on TIKA-2400:
--------------------------------------
thammegowda commented on a change in pull request #208: Fix for TIKA-2400
Standardizing current Object Recognition REST parsers
URL: https://github.com/apache/tika/pull/208#discussion_r140670009
##########
File path:
tika-parsers/src/main/java/org/apache/tika/parser/recognition/ObjectRecognitionParser.java
##########
@@ -140,29 +133,17 @@ public synchronized void parse(InputStream stream,
ContentHandler handler, Metad
for (RecognisedObject object : objects) {
if (object instanceof CaptionObject) {
if (xhtmlStartVal == null) xhtmlStartVal = "captions";
- LOG.debug("Add {}", object);
- String mdValue = String.format(Locale.ENGLISH, "%s (%.5f)",
- object.getLabel(), object.getConfidence());
- metadata.add(MD_KEY_IMG_CAP, mdValue);
- acceptedObjects.add(object);
+ String mdVal = String.format(Locale.ENGLISH, "%s (%.5f)",
object.getLabel(), object.getConfidence());
Review comment:
> would be great if we can store object.getLabel() and
object.getConfidence() into separate metadata fields.
IMHO, it complicates metadata key-values. If we split, we get two arrays of
confidence and labels, then users have to match labels with confidence using
the index in arrays. One solution to this problem is still an open issue in
Tika - i.e, support complex data structure like JSON for metadata. Until then
we have full info captured in XHML content, so it should be fine.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Standardizing current Object Recognition REST parsers
> -----------------------------------------------------
>
> Key: TIKA-2400
> URL: https://issues.apache.org/jira/browse/TIKA-2400
> Project: Tika
> Issue Type: Sub-task
> Components: parser
> Reporter: Thejan Wijesinghe
> Priority: Minor
> Fix For: 1.17
>
>
> # This involves adding apiBaseUris and refactoring current Object Recognition
> REST parsers,
> # Refactoring dockerfiles related to those parsers.
> # Moving the logic related to checking minimum confidence into servers
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)