Re: what is needed to index for about 10000 domains

alxsss Tue, 03 Mar 2009 20:49:08 -0800

 I never tried to test this configuration. What about asking nutch to download 
a certain amount of byes from the end of files?



 


 

-----Original Message-----
From: Jasper Kamperman <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 8:32 pm
Subject: Re: what is needed to index for about 10000 domains









One reason to store content is if you want to show snippets in search results.?
?

Another reason is if you want to have a "cached" feature where you can give the 
user the page as it looked when you crawled it (it may since have disappeared.?
?

There is a way to tell nutch to look at only the beginning of a file, it's this 
section in your config.xml:?
?

<property>?

? <name>file.content.limit</name>?

? <value>65536</value>?

? <description>The length limit for downloaded content, in bytes.?

? If this value is nonnegative (>=0), content longer than it will be truncated;?

? otherwise, no truncation at all.?

? </description>?

</property>?
?

this is from the nutch-default.xml in 0.9, don't know whether it has changed in 
1.0 .?
?

On Mar 3, 2009, at 8:27 PM, [email protected] wrote:?
?

>?

> Hi,?

>?

> I also noticed that we can disable storing content of pages which I > use. I 
> wonder why someone needs to store content? Also, in case of > files, is there 
> a way to tell nutch not to download the whole file > but let say 1000 bytes 
> from the beginning and parse and index > information only in that part of 
> files??

>?

> Thanks.?

> A.?

>?

>?

>?

>?

>?

>?

>?

> -----Original Message-----?

> From: yanky young <[email protected]>?

> To: [email protected]?

> Sent: Tue, 3 Mar 2009 6:41 pm?

> Subject: Re: what is needed to index for about 10000 domains?

>?

>?

>?

>?

>?

>?

>?

>?

>?

>?

> Hi:?

>?

> Dog cutting has ever write a wiki about hardware requirement of > nutch, you?

> can check it out?

>?

> http://wiki.apache.org/nutch/HardwareRequirements?

>?

> good luck?

>?

> yanky?

>?

>?

> 2009/3/4 John Martyniak <[email protected]>?

>?

>> Regarding the machine, you could run it on anything,  it all >> depends what?

>> kind of performance you want.  So yes you could run it on the >> machine 
>> that?

>> you have or something like the linux machine that I have.  And the >> DSL?

>> connection should be fine, you just need to make sure that it stays >> up 
>> the?

>> whole time, because if not it will start erring out, and you will >> have to?

>> re-fetch that whole segment as there is no way to pick up from >> where you?

>> left off.?

>>?

>> The only reason that I merged the segments was that I had many of >> them, 
>> and?

>> I wanted to build a big one before I started creating new ones, >> another?

>> advantage of merging is that you can use that to clear out unwanted >> urls.?

>> For example I had a bunch of .js files in there, that I didn't want >> to 
>> have?

>> as part of the index, so I cleared them out.?

>>?

>> I used "bin/nutch mergesegs".?

>>?

>> Regarding merging the other parts, I have never used, but I don't >> think?

>> that it is necessary unless you have multiple linkdbs, etc.  in my >> case I 
>> do?

>> not.?

>>?

>> -John?

>>?

>>?

>>?

>>?

>> On Mar 3, 2009, at 7:14 PM, [email protected] wrote:?

>>?

>>?

>>> Hi,?

>>>?

>>> I will need to index all links in domains then. What do you think >>> a 
>>> linux?

>>> box like yours with DSL connection is OK to index the domains I >>> have??

>>>?

>>> Why only segments? I thought we need to merge all sub folders >>> under 
>>> crawl?

>>> folder. What did you use for merging them??

>>>?

>>> Thanks.?

>>> A.?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>> -----Original Message-----?

>>> From: John Martyniak <[email protected]>?

>>> To: [email protected]?

>>> Sent: Tue, 3 Mar 2009 3:21 pm?

>>> Subject: Re: what is needed to  index for about 10000 domains?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>> Well the way that nutch works is that you would inject your list of?

>>> domains into the DB, and that would be the starting point.  Since >>> nutch 
>>> uses?

>>> a crawler it would grab those pages, and determine if there are >>> any 
>>> links on?

>>> those pages, and then add them to the DB.  So the next time that you?

>>> generated your urls to fetch, it would take your original list, >>> plus 
>>> the?

>>> ones that it found to generate the new segment.??

>>> ??

>>>?

>>> If you wanted to limit it to only pages contained on your 10000 >>> 
>>> domains,?

>>> you could use the regex-urlfilter.txt file in the conf directory >>> to 
>>> limit it?

>>> to your list.  But you would have to create a regular expression >>> for 
>>> each?

>>> one.??

>>> ??

>>>?

>>> I am not familiar with the merge script on the wiki, but have merged?

>>> segments before and it did work.  But that was on Linux, don't >>> think 
>>> that?

>>> should make a difference though.??

>>> ??

>>>?

>>> -John??

>>> ??

>>>?

>>> ??

>>>?

>>> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:??

>>> ??

>>>?

>>> ??

>>>>?

>>>?

>>> Hi,??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> Thanks for the reply. I have list? of those domains only. I am not >>> > 
>>> sure?

>>>> how many pages they have. Is a DSL? connection sufficient to > >>>> run 
>>>> nutch in?

>>>> my case. Did you run nutch for all of your pages at > once or >>>> 
>>>> separately for?

>>>> a given subset of them. Btw, yesterday I > tried to use merge >>>> shell 
>>>> script?

>>>> that we have on wiki. It gave a lot > of errors. I run it on >>>> cygwin 
>>>> though.??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> Thanks.??

>>>>?

>>>?

>>> A.??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> -----Original Message-----??

>>>>?

>>>?

>>> From: John Martyniak <[email protected]>??

>>>>?

>>>?

>>> To: [email protected]??

>>>>?

>>>?

>>> Sent: Tue, 3 Mar 2009 1:44 pm??

>>>>?

>>>?

>>> Subject: Re: what is needed to  index for about 10000 domains??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> I think that in order to answer that questions, it is necessary to >>> > 
>>> know?

>>>> how many total pages are being indexed.???

>>>>?

>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> I currently have ~3.5 million pages indexed, and the segment >?

>>>> directories are around 45GB, The response time is relatively >>>> fast.???

>>>>?

>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> In the test site it is running on a dual processor Dell 1850 with >>> > 3GB?

>>>> of RAM.???

>>>>?

>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> -John???

>>>>?

>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:???

>>>>?

>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> Hello,???

>>>>>?

>>>>?

>>> ??

>>>>?

>>>?

>>> ???

>>>>>?

>>>>?

>>> ??

>>>>?

>>>?

>>> I use nutch-0.9 and need to index about 10000? domains.? I want to >>> >> >?

>>>>> know? minimum requirements to hardware and memory.???

>>>>>?

>>>>?

>>> ??

>>>>?

>>>?

>>> ???

>>>>>?

>>>>?

>>> ??

>>>>?

>>>?

>>> Thanks in advance.???

>>>>>?

>>>>?

>>> ??

>>>>?

>>>?

>>> Alex.???

>>>>>?

>>>>?

>>> ???

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>>?

>>> ??

>>>>?

>>> ??

>>>?

>>>?

>>>?

>>>?

>>>?

>>>?

>>?

>?

>?

>?

>?

>?
?

Re: what is needed to index for about 10000 domains

Reply via email to