RE: [Nutch-dev] distributed search

Ledio Ago Tue, 20 Dec 2005 10:06:10 -0800

Thank you Dave, very helfull.

-Ledio


-----Original Message-----
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 20, 2005 7:24 AM
To: [email protected]
Subject: RE: [Nutch-dev] distributed search


Hi Rafi,

Not sure if anyone answered this, but I think you're just after the
segslice command:

$ nutch segslice

SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count]
[-fix] [-nocontent] [-noparsedata] [-noparsetext] [-filterUrlBy
(+|-)perl5pattern] [-logLevel logLevel] (-dir segments | seg1 seg2 ...)
        NOTE: at least one segment dir name is required, or '-dir'
option.
              outputDir is always required.
        -o outputDir    output directory for segments
        -max count      (optional) output multiple segments, each with
maximum 'count' entries
        -fix            (optional) automatically fix corrupted segments
        -nocontent      (optional) ignore content data
        -noparsedata    (optional) ignore parse_data data
        -nocontent      (optional) ignore parse_text data
        -filterUrlBy    (optional)
                        Filter entry by matching its url with a perl5
pattern.
                        Prefix '+' means: default to skip, match to
save.
                        Prefix '-' means: default to save, match to
skip.
                        If no pattern given, no filtering (all are
saved).
        -logLevel       (optional) logging level
        -dir segments   directory containing multiple segments
        seg1 seg2 ...   segment directories


HTH,
DaveG


-----Original Message-----
From: Ledio Ago [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 19, 2005 8:25 PM
To: [email protected]
Subject: RE: [Nutch-dev] distributed search

Rafi,

Based on what you're saying, this tool splits a fetchlist into several
fetchlists
so that we can crawl/fetch the URLs from different fetchers, right??

If so, that's is not what I'm after.  I'm trying to split an existing
index
into smaller partitions, so that I can make those partinions searchable
from
multiple nutch serchers, distributed search.

Thanks,

Ledio

-----Original Message-----
From: Rafi Iz [mailto:[EMAIL PROTECTED]
Sent: Monday, December 19, 2005 4:49 PM
To: [email protected]
Subject: Re: [Nutch-dev] distributed search



check the next command
FetchListTool (-local | -ndfs <namenode:port>) <db>  <segment_dir> 
[-refetchonly] [-topN N] [-cutoff cutoffscore] [-numFetchers
numFetchers] 
[-adddays numDays]

This command call to a function called emitMultipleLists which spit out 
several fetchlists, so that you can fetch across several machines.

e.g.
bin/nutch org.apache.nutch.tools.FetchListTool ......

Rafi


>From: Stefan Groschupf <[EMAIL PROTECTED]>
>Reply-To: [email protected]
>To: [email protected]
>Subject: Re: [Nutch-dev] distributed search
>Date: Tue, 20 Dec 2005 00:38:22 +0100
>
>>By the way, is there an easy way to split the index I have already
have.
>>I would hate to recrawl all of the 1.9MM URLs again and waste
bandwidth.
>
>Well I do not know any tool that comes with nutch or a other tool  that

>does it, may there is one.
>But to write a java class that creates two smaller indexes from one
large 
>is very easy, a hour work maximum.
>Just check any of the existing lucene tutorial, lucene java doc or  the

>lucene book.
>BTW, Erik Hatcher's book "Lucene in action" is a MUST for all nutch
users. 
>:-)
>
>Stefan
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's
FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

RE: [Nutch-dev] distributed search

Reply via email to