[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485749#comment-13485749 ]
Ray Gauss II commented on TIKA-775: ----------------------------------- Hi Jörg, Note that the embed.diff file attached to the issue is more current and replaces the previous patch.txt files. I've also changed just a few things since posting embed.diff, primarily around error handling. I'll post another diff soon with Javadoc additions mentioned below. 1) I'm not sure exactly what you mean here. The Parser interface only guarantees a parse method and supported types. It says nothing about requiring the entire content to be extracted by the implementation. The parser interface also makes no specification about how the given input stream must be read or processed, so each implementation can do that however it sees fit. Similarly the Embedder.embed method says nothing about requiring or preventing content from being updated, so if a particular embedder implementation wants to update the content itself I suppose there's no reason it couldn't. 2) This is intentionally somewhat vague (but perhaps too much so) as each embedder may implement this slightly differently, though we should have a suggested approach, and in general I think that approach should favor preserving the source file's metadata unless explicitly specified. I will add some of this to the Javadoc but for your specific questions I think the answers would be: - Q: Does it always update all metadata in the file, i.e. does it delete properties that are not in the Metadata object? - A: Embedder implementations should only attempt to update metadata fields present in the given Metadata object - Q: How are empty properties set? - A: Embedder implementations should set properties as empty when the corresponding field in the Metadata object is an empty string, i.e. "" - Q: How do I delete properties? - A: Embedder implementations should nullify or delete properties corresponding to fields with a null value in the given Metadata object. - Q: Where does the embedding take place? - A: That's up to the embedder implementation and particular file format. - Q: Does the embed method update properties in all metadata containers? - A: Embedder implementations should set the property corresponding to a particular field in the given Metadata object in all metadata containers whenever possible and appropriate for the file format at the time. If a particular metadata container falls out of use and/or is superseded by another (such as IIC vs XMP for IPTC) it is up to the implementation to decide if and when to cease embedding in the alternate container. - Q: What happens for properties where the file format specific fields have a fixed length or different encodings? - A: Embedder implementations should attempt to embed as much of the metadata as accurately as possible. An implementation may choose a strict approach and throw an exception if a value to be embedded exceeds the length allowed or may choose to truncate the value. For that last one we could consider adding a second embed method to Embedder which also accepts a boolean isStrict parameter which would allow a single implementation to operate in a mode where it would throw exceptions on bad data vs. doing something like truncating. Implementations could always implement that themselves so I'm not sure we need it in the interface. 3 and 5) The client is in control of the output stream as the client is responsible for creating it and passing it to the embed method. The Embedder needs the given input stream to read the source data and writes the final data with metadata embedded to the given output stream. As such, consumers of the embed method are dictating what that output stream is, which will probably be a temp file in most cases, and the client can refrain from an writing to the actual source file in the case of receiving an exception. See the ExternalEmbedderTest for an example of creating a temp file output stream for the embedder to write to. 4) Yes, parser implementations could choose to implement the Embedder interface as well. That was the reason for naming getSupportedEmbedTypes differently than Parser's existing getSupportedTypes method. If the above doesn't answer your concerns I'm more than happy to flesh things out further. Regards, Ray > Embed Capabilities > ------------------ > > Key: TIKA-775 > URL: https://issues.apache.org/jira/browse/TIKA-775 > Project: Tika > Issue Type: Improvement > Components: general, metadata > Affects Versions: 1.0 > Environment: The default ExternalEmbedder requires that sed be > installed. > Reporter: Ray Gauss II > Labels: embed, patch > Fix For: 1.3 > > Attachments: embed.diff, tika-core-embed-patch.txt, > tika-parsers-embed-patch.txt > > > This patch defines and implements the concept of embedding tika metadata into > a file stream, the reverse of extraction. > In the tika-core project an interface defining an Embedder and a generic sed > ExternalEmbedder implementation meant to be extended or configured are added. > These classes are essentially a reverse flow of the existing Parser and > ExternalParser classes. > In the tika-parsers project an ExternalEmbedderTest unit test is added which > uses the default ExternalEmbedder (calls sed) to embed a value placed in > Metadata.DESCRIPTION then verify the operation by parsing the resulting > stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira