I know of nobody using "schemaless" for production for the simple
reason that it makes the best guess it can based on the _first_ time
it sees a particular field. There's absolutely no way to guarantee
that that doc is representative of all docs.

In the Tika case you've also got the problem that there's no universal
metadata definition. What's "author" in one type of doc might be
"editor" in another. Or "most_recent_edit" might be "last_edited" and
even if these are dates the format won't necessarily be the same.

And if you want to really get weird, some programs allow custom
attributes. So I might define a date field in my doc and another user
in a totally different type of doc may chance to define the same
meta-data tag as a string.

I'd also ask what the utility of having a bunch of fields in your
schema that you don't even know are there. How can anyone
intelligently search them?

What I've seen done is define a bunch of fields you define as common
across the docs you'll be ingesting (author, last_edited and the like)
and map them all from whatever comes out of the semi-structured doc
(i.e. you map "editor" in a PDF doc into "author" in your index). Then
define a catch-all dynamic field "*" that's text-based and throw
everything that you don't know what to do with into that field.

Or just throw everything you don't recognize away, which you can also
do with a dynamic field (see "ignored" in managed-schema).

Best,
Erick

On Mon, Apr 25, 2016 at 6:34 AM, Peter Blokland (JIRA) <[email protected]> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/SOLR-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256344#comment-15256344
>  ]
>
> Peter Blokland commented on SOLR-8017:
> --------------------------------------
>
> so that means that using tika metadata indexing with schemaless mode is, 
> well, useless ? because that's where this is happening to me. I don't _want_ 
> to anticipate (i.e. create a schema) for all possible metadata coming from 
> tika (I wouldn't even know how). and is it stands, I cannot be sure all my 
> documents will be indexed :(
>
>> solr.PointType can't deal with coordination in format like (0.9504547, 1.0, 
>> 1.0890503)
>> --------------------------------------------------------------------------------------
>>
>>                 Key: SOLR-8017
>>                 URL: https://issues.apache.org/jira/browse/SOLR-8017
>>             Project: Solr
>>          Issue Type: Improvement
>>    Affects Versions: 5.2
>>            Reporter: wangshanshan
>>            Priority: Minor
>>
>> In jpg picture files there will be some fields like media_white_point, 
>> media_black_point, which in format like (0.9504547, 1.0, 1.0890503).
>> But solr.PointType can't deal with the "(", it just splis by comma and let 
>> Double.parse  deal with a string like "(0.9504547".
>> In this case, a NumberFormatException will be raised.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to