I never tried to test this configuration. What about asking nutch to download a certain amount of byes from the end of files?
-----Original Message----- From: Jasper Kamperman <[email protected]> To: [email protected] Sent: Tue, 3 Mar 2009 8:32 pm Subject: Re: what is needed to index for about 10000 domains One reason to store content is if you want to show snippets in search results.? ? Another reason is if you want to have a "cached" feature where you can give the user the page as it looked when you crawled it (it may since have disappeared.? ? There is a way to tell nutch to look at only the beginning of a file, it's this section in your config.xml:? ? <property>? ? <name>file.content.limit</name>? ? <value>65536</value>? ? <description>The length limit for downloaded content, in bytes.? ? If this value is nonnegative (>=0), content longer than it will be truncated;? ? otherwise, no truncation at all.? ? </description>? </property>? ? this is from the nutch-default.xml in 0.9, don't know whether it has changed in 1.0 .? ? On Mar 3, 2009, at 8:27 PM, [email protected] wrote:? ? >? > Hi,? >? > I also noticed that we can disable storing content of pages which I > use. I > wonder why someone needs to store content? Also, in case of > files, is there > a way to tell nutch not to download the whole file > but let say 1000 bytes > from the beginning and parse and index > information only in that part of > files?? >? > Thanks.? > A.? >? >? >? >? >? >? >? > -----Original Message-----? > From: yanky young <[email protected]>? > To: [email protected]? > Sent: Tue, 3 Mar 2009 6:41 pm? > Subject: Re: what is needed to index for about 10000 domains? >? >? >? >? >? >? >? >? >? >? > Hi:? >? > Dog cutting has ever write a wiki about hardware requirement of > nutch, you? > can check it out? >? > http://wiki.apache.org/nutch/HardwareRequirements? >? > good luck? >? > yanky? >? >? > 2009/3/4 John Martyniak <[email protected]>? >? >> Regarding the machine, you could run it on anything, it all >> depends what? >> kind of performance you want. So yes you could run it on the >> machine >> that? >> you have or something like the linux machine that I have. And the >> DSL? >> connection should be fine, you just need to make sure that it stays >> up >> the? >> whole time, because if not it will start erring out, and you will >> have to? >> re-fetch that whole segment as there is no way to pick up from >> where you? >> left off.? >>? >> The only reason that I merged the segments was that I had many of >> them, >> and? >> I wanted to build a big one before I started creating new ones, >> another? >> advantage of merging is that you can use that to clear out unwanted >> urls.? >> For example I had a bunch of .js files in there, that I didn't want >> to >> have? >> as part of the index, so I cleared them out.? >>? >> I used "bin/nutch mergesegs".? >>? >> Regarding merging the other parts, I have never used, but I don't >> think? >> that it is necessary unless you have multiple linkdbs, etc. in my >> case I >> do? >> not.? >>? >> -John? >>? >>? >>? >>? >> On Mar 3, 2009, at 7:14 PM, [email protected] wrote:? >>? >>? >>> Hi,? >>>? >>> I will need to index all links in domains then. What do you think >>> a >>> linux? >>> box like yours with DSL connection is OK to index the domains I >>> have?? >>>? >>> Why only segments? I thought we need to merge all sub folders >>> under >>> crawl? >>> folder. What did you use for merging them?? >>>? >>> Thanks.? >>> A.? >>>? >>>? >>>? >>>? >>>? >>>? >>>? >>> -----Original Message-----? >>> From: John Martyniak <[email protected]>? >>> To: [email protected]? >>> Sent: Tue, 3 Mar 2009 3:21 pm? >>> Subject: Re: what is needed to index for about 10000 domains? >>>? >>>? >>>? >>>? >>>? >>>? >>>? >>>? >>>? >>> Well the way that nutch works is that you would inject your list of? >>> domains into the DB, and that would be the starting point. Since >>> nutch >>> uses? >>> a crawler it would grab those pages, and determine if there are >>> any >>> links on? >>> those pages, and then add them to the DB. So the next time that you? >>> generated your urls to fetch, it would take your original list, >>> plus >>> the? >>> ones that it found to generate the new segment.?? >>> ?? >>>? >>> If you wanted to limit it to only pages contained on your 10000 >>> >>> domains,? >>> you could use the regex-urlfilter.txt file in the conf directory >>> to >>> limit it? >>> to your list. But you would have to create a regular expression >>> for >>> each? >>> one.?? >>> ?? >>>? >>> I am not familiar with the merge script on the wiki, but have merged? >>> segments before and it did work. But that was on Linux, don't >>> think >>> that? >>> should make a difference though.?? >>> ?? >>>? >>> -John?? >>> ?? >>>? >>> ?? >>>? >>> On Mar 3, 2009, at 5:10 PM, [email protected] wrote:?? >>> ?? >>>? >>> ?? >>>>? >>>? >>> Hi,?? >>>>? >>>? >>> ?? >>>>? >>>? >>> Thanks for the reply. I have list? of those domains only. I am not >>> > >>> sure? >>>> how many pages they have. Is a DSL? connection sufficient to > >>>> run >>>> nutch in? >>>> my case. Did you run nutch for all of your pages at > once or >>>> >>>> separately for? >>>> a given subset of them. Btw, yesterday I > tried to use merge >>>> shell >>>> script? >>>> that we have on wiki. It gave a lot > of errors. I run it on >>>> cygwin >>>> though.?? >>>>? >>>? >>> ?? >>>>? >>>? >>> Thanks.?? >>>>? >>>? >>> A.?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> -----Original Message-----?? >>>>? >>>? >>> From: John Martyniak <[email protected]>?? >>>>? >>>? >>> To: [email protected]?? >>>>? >>>? >>> Sent: Tue, 3 Mar 2009 1:44 pm?? >>>>? >>>? >>> Subject: Re: what is needed to index for about 10000 domains?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> I think that in order to answer that questions, it is necessary to >>> > >>> know? >>>> how many total pages are being indexed.??? >>>>? >>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> I currently have ~3.5 million pages indexed, and the segment >? >>>> directories are around 45GB, The response time is relatively >>>> fast.??? >>>>? >>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> In the test site it is running on a dual processor Dell 1850 with >>> > 3GB? >>>> of RAM.??? >>>>? >>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> -John??? >>>>? >>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> On Mar 3, 2009, at 3:44 PM, [email protected] wrote:??? >>>>? >>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> Hello,??? >>>>>? >>>>? >>> ?? >>>>? >>>? >>> ??? >>>>>? >>>>? >>> ?? >>>>? >>>? >>> I use nutch-0.9 and need to index about 10000? domains.? I want to >>> >> >? >>>>> know? minimum requirements to hardware and memory.??? >>>>>? >>>>? >>> ?? >>>>? >>>? >>> ??? >>>>>? >>>>? >>> ?? >>>>? >>>? >>> Thanks in advance.??? >>>>>? >>>>? >>> ?? >>>>? >>>? >>> Alex.??? >>>>>? >>>>? >>> ??? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>>? >>> ?? >>>>? >>> ?? >>>? >>>? >>>? >>>? >>>? >>>? >>? >? >? >? >? >? ?
