[ 
https://issues.apache.org/jira/browse/NUTCH-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-994:
--------------------------------

    Attachment: NUTCH-994-all.patch

This patches changes:
* non-analyzed field types to their Trie-based equivalent. No high precisions 
used because little or no range queries are expected from data generated by 
Nutch.
* removed RemoveDuplicatesTokenFilterFactory from URL field type. There is no 
stemmer involved that can blow up TF/IDF.
* adds cc field for creativecommons plugin. Not sure whether is should be 
tokenized to allow for more flexible search.

For clarity i have added fields created by plugin that come with the release. I 
haven't found any in parse-swf. I also didn't add fields from the urlmeta 
plugin since it is unclear which field names are found.

I also didn't add the tag field for microformats-reltag plugin, it collides 
with the same field name for the feed plugin. Any thoughs on this? Change what?

I'd still like to change date fields that do not use the date field type to use 
a proper date field type. This depends on NUTCH-985, the same goes for the feed 
plugin, if we still want to ship it in the release (julian?).

I kept the 80-column `wordwrap` although it only fills up less than halve my 
screens ;)

> Fine tune Solr schema
> ---------------------
>
>                 Key: NUTCH-994
>                 URL: https://issues.apache.org/jira/browse/NUTCH-994
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-994-all.patch
>
>
> The supplied schema is old and doesn't use more advanced fieldTypes such as 
> Trie based (since Solr 1.4) and perhaps other improvements. We need to fine 
> tune the schema.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to