[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2015-03-22 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14374986#comment-14374986
 ] 

Nick Burch commented on TIKA-1344:
--

This might be a good one to add as an example, based on the recursing example 
we have but showing how to marry the content handler changes with parser 
resources

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2015-03-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372082#comment-14372082
 ] 

Tyler Palsulich commented on TIKA-1344:
---

[~gagravarr], can we close this one off? Thank you, [~skibaa]!

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2014-06-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037164#comment-14037164
 ] 

Nick Burch commented on TIKA-1344:
--

I think the plan was always that people would have their content handler 
re-write these with URLs that match where they wrote the embedded images to

One such example is in Alfresco - 
http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java

Rather than hacking it for one kind of file format, maybe it would be better to 
have a generic content handler wrapper which would capture the embedded images 
when the parser offers to recurse into them, decides if they're small enough, 
encodes, then re-writes the html to be a data link?

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2014-06-19 Thread Andrew Skiba (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037207#comment-14037207
 ] 

Andrew Skiba commented on TIKA-1344:


As far as I understand, the handler has no access to the Picture.getContent()

I also prepared another patch for PDF, and it looks different, because PDF 
parser does not use org.apache.poi

May be I miss your point. In Alfresco example I also see the private method 
handleImage which is called from parse()

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2014-06-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037212#comment-14037212
 ] 

Nick Burch commented on TIKA-1344:
--

You won't fetch the Picture directly. Instead, you'll register a recursing 
Parser, which'll get called with all the embedded resources, and you'd generate 
the data url from that. This approach should work for all parsers

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2014-06-19 Thread Andrew Skiba (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037236#comment-14037236
 ] 

Andrew Skiba commented on TIKA-1344:


Can you give an example in existing code for such recursing Parser? My 
familiarity with Tika code base is not sufficient for fixing this patch.

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1344) Ability to generate self-contained HTML with images

2014-06-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037243#comment-14037243
 ] 

Nick Burch commented on TIKA-1344:
--

There are some examples on the wiki - 
https://wiki.apache.org/tika/RecursiveMetadata, or ask on the dev list or 
reviewboard for advice once you've got some code in place

 Ability to generate self-contained HTML with images
 ---

 Key: TIKA-1344
 URL: https://issues.apache.org/jira/browse/TIKA-1344
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Andrew Skiba
  Labels: easyfix, patch
 Attachments: word.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 n the current code, the images from Word documents are referenced by 
 embedded:xxx links in the generated HTML. This causes the browsers display 
 x icon instead of the image.
 The proposed patch encodes the images using Data URI, if there is 
 -Dtika.parsers.urlimages system property. 
 http://en.wikipedia.org/wiki/Data_URI_scheme
 So the default behavior is the same, but users of the library can optionally 
 generate self-contained HTML with correct images.



--
This message was sent by Atlassian JIRA
(v6.2#6252)