[ https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869 ]
Enis Soztutar commented on NUTCH-442: ------------------------------------- Using nutch with solr has been a very demanding request, so it will be very useful when this makes into trunk. I have spend some time reviewing the patch, which I find quite elegant. Some improvements to the patch would be - make NutchDocument implement VersionedWritable instead of writable, and delegate version checking to superclass - refactor getDetails() methods in HitDetailer to Searcher (it is not likely that a class would implement Searcher but not HitDetailer) - use Searcher, delete HitDetailer and SearchBean - Rename XXXBean classes so that they do not include "bean". (I think it is confusing to have bean objects that have non-trivial functionality) - refactor LuceneSearchBean.VERSION to RPCSearchBean - remove unrelated changes from the patch.(the changes in NGramProfile, HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong) As far as i can see, we do not need any metadata for Solr backend, and only need Store,Index and Vector options for lucene backend, so i think we can simplify NutchDocument#metadata. We may implement : {code} class FieldMeta { o.a.l.document.Field.Store store; o.a.l.document.Field.Index index; o.a.l.document.Field.TermVector tv; } FieldMeta[] IndexingFilter.getFields(); class NutchDocument { ... private ArrayList<Field> fieldMeta; ... } {code} Or alternatively we may wish to keep add methods of NutchDocument compatible with o.a.l.document.Document, keeping the metadata up-to-date as we add new fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will be slightly slower but the API will be much more intuitive. > Integrate Solr/Nutch > -------------------- > > Key: NUTCH-442 > URL: https://issues.apache.org/jira/browse/NUTCH-442 > Project: Nutch > Issue Type: New Feature > Environment: Ubuntu linux > Reporter: rubdabadub > Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, > schema.xml > > > Hi: > After trying out Sami's patch regarding Solr/Nutch. Can be found here > (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html) > and I can confirm it worked :-) And that lead me to request the following : > I would be very very great full if this could be included in nutch 0.9 as I > am trying to eliminate my python based crawler which post documents to solr. > As I am in the corporate enviornment I can't install trunk version in the > production enviornment thus I am asking this to be included in 0.9 release. I > hope my wish would be granted. > I look forward to get some feedback. > Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.