I know of nobody using "schemaless" for production for the simple reason that it makes the best guess it can based on the _first_ time it sees a particular field. There's absolutely no way to guarantee that that doc is representative of all docs.
In the Tika case you've also got the problem that there's no universal metadata definition. What's "author" in one type of doc might be "editor" in another. Or "most_recent_edit" might be "last_edited" and even if these are dates the format won't necessarily be the same. And if you want to really get weird, some programs allow custom attributes. So I might define a date field in my doc and another user in a totally different type of doc may chance to define the same meta-data tag as a string. I'd also ask what the utility of having a bunch of fields in your schema that you don't even know are there. How can anyone intelligently search them? What I've seen done is define a bunch of fields you define as common across the docs you'll be ingesting (author, last_edited and the like) and map them all from whatever comes out of the semi-structured doc (i.e. you map "editor" in a PDF doc into "author" in your index). Then define a catch-all dynamic field "*" that's text-based and throw everything that you don't know what to do with into that field. Or just throw everything you don't recognize away, which you can also do with a dynamic field (see "ignored" in managed-schema). Best, Erick On Mon, Apr 25, 2016 at 6:34 AM, Peter Blokland (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/SOLR-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256344#comment-15256344 > ] > > Peter Blokland commented on SOLR-8017: > -------------------------------------- > > so that means that using tika metadata indexing with schemaless mode is, > well, useless ? because that's where this is happening to me. I don't _want_ > to anticipate (i.e. create a schema) for all possible metadata coming from > tika (I wouldn't even know how). and is it stands, I cannot be sure all my > documents will be indexed :( > >> solr.PointType can't deal with coordination in format like (0.9504547, 1.0, >> 1.0890503) >> -------------------------------------------------------------------------------------- >> >> Key: SOLR-8017 >> URL: https://issues.apache.org/jira/browse/SOLR-8017 >> Project: Solr >> Issue Type: Improvement >> Affects Versions: 5.2 >> Reporter: wangshanshan >> Priority: Minor >> >> In jpg picture files there will be some fields like media_white_point, >> media_black_point, which in format like (0.9504547, 1.0, 1.0890503). >> But solr.PointType can't deal with the "(", it just splis by comma and let >> Double.parse deal with a string like "(0.9504547". >> In this case, a NumberFormatException will be raised. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
