[jira] [Commented] (TIKA-775) Embed Capabilities

JIRA Sun, 28 Oct 2012 10:03:13 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485665#comment-13485665
 ]


Jörg Ehrlich commented on TIKA-775:
-----------------------------------

Hi Ray,

I think it would be great if Tika could also write Metadata back to files and 
it would be great to start on this rather sooner than later.
But I have a couple of comments regarding your proposed implementation:

1) Right now the Parsers do both content and metadata extraction. The proposed 
embedder does only Metadata embedding, which is fine because updating of 
content would be out of scope for Tika.
But if we introduce separate APIs to embed just metadata I think it would make 
sense to also introduce APIs to only extract metadata. Actually at Adobe we had 
stop using Tika to retrieve Metadata from specific file formats because it 
always parses the whole content which is simply too heavy an operation to scale 
in a larger system.
So I planned to get started on a new API and adjustments to parsers to just 
retrieve Metadata from files, but did not have time for this, yet. I guess it 
would make sense to synchronize these two new APIs, right?
Being able to just parse Metadata from files is actually also very important 
for the embedding of it, which I will explain further down.

2) Your documentation does not really specify in detail the behavior of the 
metadata update that should happen.
Does it always update all metadata in the file, i.e. does it delete properties 
that are not in the Metadata object? Or does it only update those properties 
that are provided in the Metadata object? How do I delete properties then? Do I 
make the property empty? But empty properties are in most metadata containers a 
valid property value and should not delete the property.
Where does the embedding take place? A lot of file formats have several 
metadata containers with similar properties. Does the embed method update all 
of them? Or just the ones, the parsers were looking at? What happens in case of 
inconsistencies? Do you read/write from specific fields or do you reconcile all 
of them together?
What happens for properties where the file format specific fields have a fixed 
length or different encodings? Do you just write as much as possible and the 
rest is simply ignored? 

For all such questions, you have to think about whether it makes sense to 
provide the client with the ability to either configure the embedder or provide 
a callback API for the client to decide if specific scenarios arise or if the 
embedder should always just do a best guess for the client.

In any such case, it is usually for the client important to get the original 
metadata from the file, before writing it back, so that no properties are 
wrongly deleted or changed. But even more so it is important for the Embedder 
as it would in most cases have to read the metadata anyway, in order to know 
how to update the file properly. It usually has to check if an in-place update 
of metadata can happen or if the whole file has to be restructured because the 
metadata chunks have grown too large to fit where they were before.
That's why I think it would be important to have a get-only-metadata API and 
Parser capabilities available, before starting writing it back.

3) This also leads me to the topic of error recovery and safe updating of 
files. I think the documentation should be more clear about what the Embedder 
will do in case of an error and what is expected by the client. 
There are all sorts of reasons the embedding could fail. If that happens, the 
original file usually ends up being corrupt and lost for the user. So it 
usually makes sense (for samller files) to do a safe update, which means 
writing the update in a new file and then swap it with the original one, after 
the update was successful.
But what about scenarios where a partial update is possible? You often have 
files where just specific metadata sections are corrupt because some tool did 
not read the spec and wrote it wrongly. But the rest of the file is still ok, 
so other parts could still be updated. Do you want to provide a callback API 
for the client to be able to react to error scenarios and decide what he wants 
to do? The embedder could do a best guess action, but that is usually quite 
dangerous for the user's files.

4) I take it that the expectation is that all parsers could also potentially 
implement the Embedder interface, so that both reading and writing is in one 
hand? Otherwise you probably end up with all sorts of inconsistencies between 
the two implementations regarding what metadata fields are read from where and 
what should be updated when, etc.

5) Why do you pass in an InputStream? That would mean the Embedder has to open 
up an own OutputStream to be able to write. That would imply that Tika knows 
how to properly create OutputStreams in the client's environment. Wouldn't it 
be better to leave the client in control here? And why do you want to return 
the InputStream?

6) I also agree with Jukka's comments that for such an important new feature we 
should spend some more thoughts on this. I think your proposal works ok for the 
external embedder scenario but I am not so sure for other scenarios.

Sorry that I did not speak up earlier. This issue has been around for quite a 
while.
Regards
Jörg
                
> Embed Capabilities
> ------------------
>
>                 Key: TIKA-775
>                 URL: https://issues.apache.org/jira/browse/TIKA-775
>             Project: Tika
>          Issue Type: Improvement
>          Components: general, metadata
>    Affects Versions: 1.0
>         Environment: The default ExternalEmbedder requires that sed be 
> installed.
>            Reporter: Ray Gauss II
>              Labels: embed, patch
>             Fix For: 1.3
>
>         Attachments: embed.diff, tika-core-embed-patch.txt, 
> tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-775) Embed Capabilities

Reply via email to