[jira] [Commented] (NUTCH-923) Multilingual support for Solr-index-mapping

2012-08-03 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428003#comment-13428003
 ] 

Luca Cavanna commented on NUTCH-923:


That's brilliant. Thanks Markus for your insight.

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor
 Attachments: patch-923-nutch-release-1.2.txt


 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-923) Multilingual support for Solr-index-mapping

2012-05-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276954#comment-13276954
 ] 

Markus Jelsma commented on NUTCH-923:
-

Solr now has a LangId request processor on board that can both detect languages 
and send values to the proper field. You can either do language identification 
in Nutch or delegate it to Solr.

https://wiki.apache.org/solr/LanguageDetection

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor
 Attachments: patch-923-nutch-release-1.2.txt


 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2011-01-28 Thread bronco (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988005#action_12988005
 ] 

bronco commented on NUTCH-923:
--

This is really a useful feature and it matches to the drupal apache solr 
multilanguage modul. So how can I implement it? At the moment I have only a 
need for english and german,  so can we restrict it to this languages at first 
and if the language identifier is not clear just use a fallback language like 
english?

at least subscribe +1

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-23 Thread Matthias Agethle (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924145#action_12924145
 ] 

Matthias Agethle commented on NUTCH-923:


What about querying Solr for the configured fields (perhaps one can do this 
using LukeRequestHandler, I'm not sure)?
When sending data to Solr one could check if they exist in the Solr schema; if 
not don't add this field and give a warning.

The other thing that comes to my mind is: what are valid field-names in solr? 
Obviously letters, numbers and so on, but is there a validation in Solr?
One could use this to check if a dynamically generated field name is compliant 
with solr (and in this way excluding control characters in field-names as 
Andrzej mentioned it).

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924154#action_12924154
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This doesn't solve the problem of potentially unbounded number of fields. 
Compliance is one thing, and you can clean up field names from invalid 
characters, but sanity is another thing - if you have {{title_*}} in your Solr 
schema then theoretically you are allowed to create unlimited number of fields 
with this prefix - Solr won't complain.

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923879#action_12923879
 ] 

Markus Jelsma commented on NUTCH-923:
-

This is a very useful feature. +1

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923896#action_12923896
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This sounds useful, though the implementation needs to keep the following in 
mind:
* you _assume_ that the lang field will have a nice predictable value, but 
unless you sanitize the values you can't assume anything... example: one page I 
saw had a language metadata set to a random string 8kB long with various 
control chars and '\0'-s.

* again, if you don't sanitize and control the total number of unique values in 
the source field, you could end up with a number of fields approaching 
infinity, and Solr would melt down...

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12923919#action_12923919
 ] 

Markus Jelsma commented on NUTCH-923:
-

Andrzej is right. The LanguageIndexingFilter can return a value based on the 
value found in the HTTP header which can return garbage but shouldn't the 
filter itself make sure either `unknown` or a valid ISO-639-2 value is set?

This way client code can safely rely on the value of the lang field instead of 
sanitizing. What if more components come that do something with the lang field, 
must they also sanitize on their own?

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor

 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.