[jira] [Commented] (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428003#comment-13428003 ] Luca Cavanna commented on NUTCH-923: That's brilliant. Thanks Markus for your insight. Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor Attachments: patch-923-nutch-release-1.2.txt It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276954#comment-13276954 ] Markus Jelsma commented on NUTCH-923: - Solr now has a LangId request processor on board that can both detect languages and send values to the proper field. You can either do language identification in Nutch or delegate it to Solr. https://wiki.apache.org/solr/LanguageDetection Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor Attachments: patch-923-nutch-release-1.2.txt It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988005#action_12988005 ] bronco commented on NUTCH-923: -- This is really a useful feature and it matches to the drupal apache solr multilanguage modul. So how can I implement it? At the moment I have only a need for english and german, so can we restrict it to this languages at first and if the language identifier is not clear just use a fallback language like english? at least subscribe +1 Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924145#action_12924145 ] Matthias Agethle commented on NUTCH-923: What about querying Solr for the configured fields (perhaps one can do this using LukeRequestHandler, I'm not sure)? When sending data to Solr one could check if they exist in the Solr schema; if not don't add this field and give a warning. The other thing that comes to my mind is: what are valid field-names in solr? Obviously letters, numbers and so on, but is there a validation in Solr? One could use this to check if a dynamically generated field name is compliant with solr (and in this way excluding control characters in field-names as Andrzej mentioned it). Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924154#action_12924154 ] Andrzej Bialecki commented on NUTCH-923: - This doesn't solve the problem of potentially unbounded number of fields. Compliance is one thing, and you can clean up field names from invalid characters, but sanity is another thing - if you have {{title_*}} in your Solr schema then theoretically you are allowed to create unlimited number of fields with this prefix - Solr won't complain. Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923879#action_12923879 ] Markus Jelsma commented on NUTCH-923: - This is a very useful feature. +1 Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923896#action_12923896 ] Andrzej Bialecki commented on NUTCH-923: - This sounds useful, though the implementation needs to keep the following in mind: * you _assume_ that the lang field will have a nice predictable value, but unless you sanitize the values you can't assume anything... example: one page I saw had a language metadata set to a random string 8kB long with various control chars and '\0'-s. * again, if you don't sanitize and control the total number of unique values in the source field, you could end up with a number of fields approaching infinity, and Solr would melt down... Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923919#action_12923919 ] Markus Jelsma commented on NUTCH-923: - Andrzej is right. The LanguageIndexingFilter can return a value based on the value found in the HTTP header which can return garbage but shouldn't the filter itself make sure either `unknown` or a valid ISO-639-2 value is set? This way client code can safely rely on the value of the lang field instead of sanitizing. What if more components come that do something with the lang field, must they also sanitize on their own? Multilingual support for Solr-index-mapping --- Key: NUTCH-923 URL: https://issues.apache.org/jira/browse/NUTCH-923 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.2 Reporter: Matthias Agethle Assignee: Markus Jelsma Priority: Minor It would be useful to extend the mapping-possibilites when indexing to solr. One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields. The mapping file could be as follows: field dest=lang source=lang/ field dest=title_${lang} source=title / so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages. What do you think? Could this be useful also to others? Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.