Re: what is needed to index for about 10000 domains

Jasper Kamperman Tue, 03 Mar 2009 20:33:30 -0800

One reason to store content is if you want to show snippets in searchresults.

Another reason is if you want to have a "cached" feature where you cangive the user the page as it looked when you crawled it (it may sincehave disappeared.

There is a way to tell nutch to look at only the beginning of a file,it's this section in your config.xml:


<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.

If this value is nonnegative (>=0), content longer than it will betruncated;

  otherwise, no truncation at all.
  </description>
</property>

this is from the nutch-default.xml in 0.9, don't know whether it haschanged in 1.0 .


On Mar 3, 2009, at 8:27 PM, [email protected] wrote:

Hi,
I also noticed that we can disable storing content of pages which Iuse. I wonder why someone needs to store content? Also, in case offiles, is there a way to tell nutch not to download the whole filebut let say 1000 bytes from the beginning and parse and indexinformation only in that part of files?
Thanks.
A.







-----Original Message-----
From: yanky young <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 6:41 pm
Subject: Re: what is needed to index for about 10000 domains










Hi:
Dog cutting has ever write a wiki about hardware requirement ofnutch, you
can check it out

http://wiki.apache.org/nutch/HardwareRequirements

good luck

yanky


2009/3/4 John Martyniak <[email protected]>
Regarding the machine, you could run it on anything, it alldepends whatkind of performance you want. So yes you could run it on themachine thatyou have or something like the linux machine that I have. And theDSLconnection should be fine, you just need to make sure that it staysup thewhole time, because if not it will start erring out, and you willhave tore-fetch that whole segment as there is no way to pick up fromwhere you
left off.
The only reason that I merged the segments was that I had many ofthem, andI wanted to build a big one before I started creating new ones,anotheradvantage of merging is that you can use that to clear out unwantedurls.For example I had a bunch of .js files in there, that I didn't wantto have
as part of the index, so I cleared them out.

I used "bin/nutch mergesegs".
Regarding merging the other parts, I have never used, but I don'tthinkthat it is necessary unless you have multiple linkdbs, etc. in mycase I do
not.

-John




On Mar 3, 2009, at 7:14 PM, [email protected] wrote:
Hi,
I will need to index all links in domains then. What do you thinka linuxbox like yours with DSL connection is OK to index the domains Ihave?
Why only segments? I thought we need to merge all sub foldersunder crawl
folder. What did you use for merging them?

Thanks.
A.







-----Original Message-----
From: John Martyniak <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 3:21 pm
Subject: Re: what is needed to  index for about 10000 domains









Well the way that nutch works is that you would inject your list of
domains into the DB, and that would be the starting point. Sincenutch usesa crawler it would grab those pages, and determine if there areany links on
those pages, and then add them to the DB.  So the next time that you
generated your urls to fetch, it would take your original list,plus the
ones that it found to generate the new segment.?
?
If you wanted to limit it to only pages contained on your 10000domains,you could use the regex-urlfilter.txt file in the conf directoryto limit itto your list. But you would have to create a regular expressionfor each
one.?
?

I am not familiar with the merge script on the wiki, but have merged
segments before and it did work. But that was on Linux, don'tthink that
should make a difference though.?
?

-John?
?

?

On Mar 3, 2009, at 5:10 PM, [email protected] wrote:?
?

?
Hi,?
?
Thanks for the reply. I have list? of those domains only. I am not> sure
how many pages they have. Is a DSL? connection sufficient to >run nutch inmy case. Did you run nutch for all of your pages at > once orseparately fora given subset of them. Btw, yesterday I > tried to use mergeshell scriptthat we have on wiki. It gave a lot > of errors. I run it oncygwin though.?
?
Thanks.?
A.?
?
?
?
?
?
?
?
-----Original Message-----?
From: John Martyniak <[email protected]>?
To: [email protected]?
Sent: Tue, 3 Mar 2009 1:44 pm?
Subject: Re: what is needed to  index for about 10000 domains?
?
?
?
?
?
?
?
?
?
I think that in order to answer that questions, it is necessary to> know
how many total pages are being indexed.??
??
?
I currently have ~3.5 million pages indexed, and the segment >
directories are around 45GB, The response time is relativelyfast.??
??
?
In the test site it is running on a dual processor Dell 1850 with> 3GB
of RAM.??
??
?
-John??
??
?
On Mar 3, 2009, at 3:44 PM, [email protected] wrote:??
??
?
Hello,??
?
??
?
I use nutch-0.9 and need to index about 10000? domains.? I want to>> >
know? minimum requirements to hardware and memory.??
?
??
?
Thanks in advance.??
?
Alex.??
??
?
?
?
?
?
?

Re: what is needed to index for about 10000 domains

Reply via email to