[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

Enis Soztutar (JIRA) Mon, 15 Oct 2007 08:34:20 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534869
 ]


Enis Soztutar commented on NUTCH-442:
-------------------------------------

Using nutch with solr has been a very demanding request, so it will be very 
useful when this makes into trunk. I have spend some time reviewing the patch, 
which I find quite elegant. 

Some improvements to the patch would be 
- make NutchDocument implement VersionedWritable instead of writable, and 
delegate version checking to superclass
- refactor getDetails() methods in HitDetailer to Searcher (it is not likely 
that a class would implement Searcher but not HitDetailer)
- use Searcher, delete HitDetailer and SearchBean 
- Rename XXXBean classes so that they do not include "bean". (I think it is 
confusing to have bean objects that have non-trivial functionality)
- refactor LuceneSearchBean.VERSION to RPCSearchBean
- remove unrelated changes from the patch.(the changes in NGramProfile, 
HTMLLanguageParser,LanguageIdentifier,... correct me if i'm wrong)

As far as i can see, we do not need any metadata for Solr backend, and only 
need Store,Index and Vector options for lucene backend, so i think we can 
simplify NutchDocument#metadata. We may implement :  
{code}
class FieldMeta {
o.a.l.document.Field.Store store;
o.a.l.document.Field.Index index;
o.a.l.document.Field.TermVector tv;
}

FieldMeta[] IndexingFilter.getFields();

class NutchDocument {
...
private ArrayList<Field> fieldMeta;
...
}

{code}

Or alternatively we may wish to keep add methods of NutchDocument compatible 
with o.a.l.document.Document, keeping the metadata up-to-date as we add new 
fields, using this info at LuceneWriter, but ignoring in SolrWriter. This will 
be slightly slower but the API will be much more intuitive. 

> Integrate Solr/Nutch
> --------------------
>
>                 Key: NUTCH-442
>                 URL: https://issues.apache.org/jira/browse/NUTCH-442
>             Project: Nutch
>          Issue Type: New Feature
>         Environment: Ubuntu linux
>            Reporter: rubdabadub
>         Attachments: NUTCH_442_v3.patch, RFC_multiple_search_backends.patch, 
> schema.xml
>
>
> Hi:
> After trying out Sami's patch regarding Solr/Nutch. Can be found here 
> (http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html)
>  and I can confirm it worked :-) And that lead me to request the following :
> I would be very very great full if this could be included in nutch 0.9 as I 
> am trying to eliminate my python based crawler which post documents to solr. 
> As I am in the corporate enviornment I can't install trunk version in the 
> production enviornment thus I am asking this to be included in 0.9 release. I 
> hope my wish would be granted.
> I look forward to get some feedback.
> Thank you.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-442) Integrate Solr/Nutch

Reply via email to