[
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758358#comment-13758358
]
Tom Hill commented on NUTCH-1517:
---------------------------------
I've attached a patch that adds CloudSearch as a pluggable indexing back-end.
Slightly verbose description of how to test:
1. Create a CloudSearch domain
note the document endpoint
I created the following fields in the domain
anchor Active text
(Result)
author Active literal
(Search Result)
boost Active literal
(Search Result)
cache Active literal
(Search Result)
content Active text
(Result)
content_length Active literal
(Search Result)
digest Active literal
(Search Result)
feed Active literal
(Search Result)
host Active literal
(Search Result)
id Active literal
(Search Result)
lang Active literal
(Search Result)
published_date Active uint ()
segment Active literal
(Search Result)
subcollection Active literal
(Search Result)
tag Active literal
(Search Result)
text Active text
(Result)
title Active text
(Result)
tstamp Active uint ()
type Active literal
(Search Result)
updated_date Active uint ()
url Active text
(Result)
2. Checkout nutch
git clone https://github.com/apache/nutch
3. Switch to 1.7 branch
git checkout -t origin/branch-1.7
4. Apply attached patch
I created it with : git diff remotes/origin/branch-1.7 --no-prefix >
indexer-cloudsearch.patch
applied with: patch -p0 -i ~/code/nutch/indexer-cloudsearch.patch
5. Edit conf/nutch-default.xml
add the document endpoint under the cloudsearch parameters (add http:// on
the front and /2011-02-01/documents/batch on the end)
change the line with "indexer-solr" to "indexer-cloudsearch"
6. Build nutch
Just "ant" in top directory.
builds "runtime" directory, and "local" under that.
7. cd to nutch/runtime/local
8. Do step three of the tutorial at http://wiki.apache.org/nutch/NutchTutorial
1) You've done step #1 already
2) Step 2, I didn't have to do, it was all correct already
3) Do step 3, stop before 3.1
a) Then do this: bin/nutch crawl urls -dir crawl -depth 3 -topN 5
b) 3.2 through 5.x SKIP
4) skip tutorial step 4
5) skip tutorial step 5
6) Parts of step 6.
Check that the domain is ready
Then just do the one line
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb
crawl/linkdb crawl/segments/*
Don't worry about the URL, it's ignored. The real URL comes from
nutch-default.xml (set above)
(This is a hack, since I'm not sure how to integrate properly.
Hopefully someone can help here)
9.Check logs/hadoop.log
Should show The adds sent to CloudSearch. Errors show there, too.
Might have to set logging level to info in
nutch/runtime/local/conf/log4j.properties
> CloudSearch indexer
> -------------------
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a
> JSON based representation Search Data Format (SDF), which we could reuse for
> a file based indexer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira