>> so that means that using tika metadata indexing with schemaless mode > is, well, useless ? Yes.
>I know of nobody using "schemaless" for production for the simple >reason that >it makes the best guess it can based on the _first_ time it >sees a particular >field. There's absolutely no way to guarantee that that >doc is representative >of all docs. > And if you want to really get weird, some programs allow custom attributes. Agreed. It makes no sense to go schemaless with Tika's metadata. >In the Tika case you've also got the problem that there's no universal >metadata definition. What's "author" >in one type of doc might be "editor" in >another. Or "most_recent_edit" might be "last_edited" and even if >these are >dates the format won't necessarily be the same. We do try to normalize across file formats to Dublin Core when possible -- dc:creator, dc:created. We also try to normalize date formats for those metadata items that we know are dates (dc:created, etc.). If you find issues with normalization or can recommend areas for improvement, please do! --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
