[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485749#comment-13485749
 ] 

Ray Gauss II commented on TIKA-775:
-----------------------------------


Hi Jörg,

Note that the embed.diff file attached to the issue is more current and 
replaces the previous patch.txt files.  I've also changed just a few things 
since posting embed.diff, primarily around error handling.  I'll post another 
diff soon with Javadoc additions mentioned below.

1) I'm not sure exactly what you mean here.  The Parser interface only 
guarantees a parse method and supported types.  It says nothing about requiring 
the entire content to be extracted by the implementation.  The parser interface 
also makes no specification about how the given input stream must be read or 
processed, so each implementation can do that however it sees fit.  Similarly 
the Embedder.embed method says nothing about requiring or preventing content 
from being updated, so if a particular embedder implementation wants to update 
the content itself I suppose there's no reason it couldn't.

2) This is intentionally somewhat vague (but perhaps too much so) as each 
embedder may implement this slightly differently, though we should have a 
suggested approach, and in general I think that approach should favor 
preserving the source file's metadata unless explicitly specified. I will add 
some of this to the Javadoc but for your specific questions I think the answers 
would be:

- Q: Does it always update all metadata in the file, i.e. does it delete 
properties that are not in the Metadata object?
- A: Embedder implementations should only attempt to update metadata fields 
present in the given Metadata object

- Q: How are empty properties set?
- A: Embedder implementations should set properties as empty when the 
corresponding field in the Metadata object is an empty string, i.e. ""

- Q: How do I delete properties?
- A: Embedder implementations should nullify or delete properties corresponding 
to fields with a null value in the given Metadata object.

- Q: Where does the embedding take place?
- A: That's up to the embedder implementation and particular file format.

- Q: Does the embed method update properties in all metadata containers?
- A: Embedder implementations should set the property corresponding to a 
particular field in the given Metadata object in all metadata containers 
whenever possible and appropriate for the file format at the time.  If a 
particular metadata container falls out of use and/or is superseded by another 
(such as IIC vs XMP for IPTC) it is up to the implementation to decide if and 
when to cease embedding in the alternate container.

- Q: What happens for properties where the file format specific fields have a 
fixed length or different encodings?
- A: Embedder implementations should attempt to embed as much of the metadata 
as accurately as possible.  An implementation may choose a strict approach and 
throw an exception if a value to be embedded exceeds the length allowed or may 
choose to truncate the value.

For that last one we could consider adding a second embed method to Embedder 
which also accepts a boolean isStrict parameter which would allow a single 
implementation to operate in a mode where it would throw exceptions on bad data 
vs. doing something like truncating.  Implementations could always implement 
that themselves so I'm not sure we need it in the interface.

3 and 5) The client is in control of the output stream as the client is 
responsible for creating it and passing it to the embed method.  The Embedder 
needs the given input stream to read the source data and writes the final data 
with metadata embedded to the given output stream.  As such, consumers of the 
embed method are dictating what that output stream is, which will probably be a 
temp file in most cases, and the client can refrain from an writing to the 
actual source file in the case of receiving an exception.  See the 
ExternalEmbedderTest for an example of creating a temp file output stream for 
the embedder to write to.

4) Yes, parser implementations could choose to implement the Embedder interface 
as well.  That was the reason for naming getSupportedEmbedTypes differently 
than Parser's existing getSupportedTypes method.


If the above doesn't answer your concerns I'm more than happy to flesh things 
out further.

Regards,

Ray
                
> Embed Capabilities
> ------------------
>
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be 
> installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>
>         Attachments: embed.diff, tika-core-embed-patch.txt, 
> tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to