[
https://issues.apache.org/jira/browse/NUTCH-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-994:
--------------------------------
Attachment: NUTCH-994-all.patch
This patches changes:
* non-analyzed field types to their Trie-based equivalent. No high precisions
used because little or no range queries are expected from data generated by
Nutch.
* removed RemoveDuplicatesTokenFilterFactory from URL field type. There is no
stemmer involved that can blow up TF/IDF.
* adds cc field for creativecommons plugin. Not sure whether is should be
tokenized to allow for more flexible search.
For clarity i have added fields created by plugin that come with the release. I
haven't found any in parse-swf. I also didn't add fields from the urlmeta
plugin since it is unclear which field names are found.
I also didn't add the tag field for microformats-reltag plugin, it collides
with the same field name for the feed plugin. Any thoughs on this? Change what?
I'd still like to change date fields that do not use the date field type to use
a proper date field type. This depends on NUTCH-985, the same goes for the feed
plugin, if we still want to ship it in the release (julian?).
I kept the 80-column `wordwrap` although it only fills up less than halve my
screens ;)
> Fine tune Solr schema
> ---------------------
>
> Key: NUTCH-994
> URL: https://issues.apache.org/jira/browse/NUTCH-994
> Project: Nutch
> Issue Type: Improvement
> Components: indexer
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-994-all.patch
>
>
> The supplied schema is old and doesn't use more advanced fieldTypes such as
> Trie based (since Solr 1.4) and perhaps other improvements. We need to fine
> tune the schema.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira