[
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923947#action_12923947
]
Andrzej Bialecki commented on NUTCH-923:
-----------------------------------------
My point was simply that if you want to build your data schema dynamically,
based on the actual input data, then you need to be aware that this process is
inherently risky - now we could perhaps deal with "lang" and
LanguageIdentifier, but tomorrow we may be dealing with dc.author or cc.license
or something else, and then we will face the same issue, ie. a potentially
unlimited number of fields created based on data.
I don't have a good answer to this problem. On one hand this functionality is
useful, on the other hand it's inherently risky in presence of less than ideal
data, which is always a possibility... Perhaps introducing some sort of
validation mechanism would make this safer to use.
> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
> Key: NUTCH-923
> URL: https://issues.apache.org/jira/browse/NUTCH-923
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.2
> Reporter: Matthias Agethle
> Assignee: Markus Jelsma
> Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page
> (for example via the language-identifier plugin) and send the content to
> corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and
> tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.