[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14374986#comment-14374986 ] Nick Burch commented on TIKA-1344: -- This might be a good one to add as an example, based on the recursing example we have but showing how to marry the content handler changes with parser resources Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372082#comment-14372082 ] Tyler Palsulich commented on TIKA-1344: --- [~gagravarr], can we close this one off? Thank you, [~skibaa]! Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037164#comment-14037164 ] Nick Burch commented on TIKA-1344: -- I think the plan was always that people would have their content handler re-write these with URLs that match where they wrote the embedded images to One such example is in Alfresco - http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java Rather than hacking it for one kind of file format, maybe it would be better to have a generic content handler wrapper which would capture the embedded images when the parser offers to recurse into them, decides if they're small enough, encodes, then re-writes the html to be a data link? Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037207#comment-14037207 ] Andrew Skiba commented on TIKA-1344: As far as I understand, the handler has no access to the Picture.getContent() I also prepared another patch for PDF, and it looks different, because PDF parser does not use org.apache.poi May be I miss your point. In Alfresco example I also see the private method handleImage which is called from parse() Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037212#comment-14037212 ] Nick Burch commented on TIKA-1344: -- You won't fetch the Picture directly. Instead, you'll register a recursing Parser, which'll get called with all the embedded resources, and you'd generate the data url from that. This approach should work for all parsers Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037236#comment-14037236 ] Andrew Skiba commented on TIKA-1344: Can you give an example in existing code for such recursing Parser? My familiarity with Tika code base is not sufficient for fixing this patch. Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images
[ https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037243#comment-14037243 ] Nick Burch commented on TIKA-1344: -- There are some examples on the wiki - https://wiki.apache.org/tika/RecursiveMetadata, or ask on the dev list or reviewboard for advice once you've got some code in place Ability to generate self-contained HTML with images --- Key: TIKA-1344 URL: https://issues.apache.org/jira/browse/TIKA-1344 Project: Tika Issue Type: Improvement Components: parser Reporter: Andrew Skiba Labels: easyfix, patch Attachments: word.patch Original Estimate: 1h Remaining Estimate: 1h n the current code, the images from Word documents are referenced by embedded:xxx links in the generated HTML. This causes the browsers display x icon instead of the image. The proposed patch encodes the images using Data URI, if there is -Dtika.parsers.urlimages system property. http://en.wikipedia.org/wiki/Data_URI_scheme So the default behavior is the same, but users of the library can optionally generate self-contained HTML with correct images. -- This message was sent by Atlassian JIRA (v6.2#6252)