>> so that means that using tika metadata indexing with schemaless mode 
> is, well, useless ?
Yes. 

>I know of nobody using "schemaless" for production for the simple >reason that 
>it makes the best guess it can based on the _first_ time it >sees a particular 
>field. There's absolutely no way to guarantee that that >doc is representative 
>of all docs.
> And if you want to really get weird, some programs allow custom attributes.

Agreed. It makes no sense to go schemaless with Tika's metadata.

>In the Tika case you've also got the problem that there's no universal 
>metadata definition. What's "author" >in one type of doc might be "editor" in 
>another. Or "most_recent_edit" might be "last_edited" and even if >these are 
>dates the format won't necessarily be the same.

We do try to normalize across file formats to Dublin Core when possible -- 
dc:creator, dc:created.  We also try to normalize date formats for those 
metadata items that we know are dates (dc:created, etc.).  If you find issues 
with normalization or can recommend areas for improvement, please do!



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to