[jira] [Commented] (NUTCH-1517) CloudSearch indexer

Tom Hill (JIRA) Wed, 04 Sep 2013 14:26:14 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758358#comment-13758358
 ]


Tom Hill commented on NUTCH-1517:
---------------------------------

I've attached a patch that adds CloudSearch as a pluggable indexing back-end. 

Slightly verbose description of how to test:

1. Create a CloudSearch domain
    note the document endpoint
    I created the following fields in the domain

      anchor                                          Active         text 
(Result)
      author                                          Active      literal 
(Search Result)
      boost                                           Active      literal 
(Search Result)
      cache                                           Active      literal 
(Search Result)
      content                                         Active         text 
(Result)
      content_length                                  Active      literal 
(Search Result)
      digest                                          Active      literal 
(Search Result)
      feed                                            Active      literal 
(Search Result)
      host                                            Active      literal 
(Search Result)
      id                                              Active      literal 
(Search Result)
      lang                                            Active      literal 
(Search Result)
      published_date                                  Active         uint ()
      segment                                         Active      literal 
(Search Result)
      subcollection                                   Active      literal 
(Search Result)
      tag                                             Active      literal 
(Search Result)
      text                                            Active         text 
(Result)
      title                                           Active         text 
(Result)
      tstamp                                          Active         uint ()
      type                                            Active      literal 
(Search Result)
      updated_date                                    Active         uint ()
      url                                             Active         text 
(Result)

2. Checkout nutch
        git clone https://github.com/apache/nutch
3. Switch to 1.7 branch
    git checkout -t origin/branch-1.7
4. Apply attached patch
    I created it with : git diff remotes/origin/branch-1.7 --no-prefix > 
indexer-cloudsearch.patch
    applied with: patch -p0 -i ~/code/nutch/indexer-cloudsearch.patch
5. Edit conf/nutch-default.xml
    add the document endpoint under the cloudsearch parameters (add http:// on 
the front and /2011-02-01/documents/batch on the end)
    change the line with "indexer-solr" to "indexer-cloudsearch"
6. Build nutch
    Just "ant" in top directory.
    builds "runtime" directory, and "local" under that.
7. cd to nutch/runtime/local
8. Do step three of the tutorial   at http://wiki.apache.org/nutch/NutchTutorial
    1)  You've done step #1 already
    2) Step 2, I didn't have to do, it was all correct already
    3) Do step 3, stop before 3.1
        a) Then do this: bin/nutch crawl urls -dir crawl -depth 3 -topN 5
        b) 3.2 through 5.x SKIP
    4) skip tutorial step 4
    5) skip tutorial step 5
    6) Parts of step 6.
        Check that the domain is ready
        Then just do the one line
        bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*
        Don't worry about the URL, it's ignored. The real URL comes from 
nutch-default.xml (set above)
        (This is a hack, since I'm not sure how to integrate properly. 
Hopefully someone can help here)
9.Check logs/hadoop.log
    Should show The adds sent to CloudSearch. Errors show there, too.
    Might have to set logging level to info in 
nutch/runtime/local/conf/log4j.properties


                
> CloudSearch indexer
> -------------------
>
>                 Key: NUTCH-1517
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1517
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Julien Nioche
>             Fix For: 1.9
>
>         Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1517) CloudSearch indexer

Reply via email to