I don't think there is a standard way to tell Nutch to download the end of a file (since you have to download the beginning before you get to the end).

But if I understand what you want, you could write a custom IndexFilter that indexes only the tail of a file, see http://wiki.apache.org/nutch/WritingPluginExample-0.9 for a start on how to do this.

On Mar 3, 2009, at 8:48 PM, [email protected] wrote:


I never tried to test this configuration. What about asking nutch to download a certain amount of byes from the end of files?







-----Original Message-----
From: Jasper Kamperman <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 8:32 pm
Subject: Re: what is needed to index for about 10000 domains









One reason to store content is if you want to show snippets in search results.?
?

Another reason is if you want to have a "cached" feature where you can give the user the page as it looked when you crawled it (it may since have disappeared.?
?

There is a way to tell nutch to look at only the beginning of a file, it's this section in your config.xml:?
?

<property>?

? <name>file.content.limit</name>?

? <value>65536</value>?

? <description>The length limit for downloaded content, in bytes.?

? If this value is nonnegative (>=0), content longer than it will be truncated;?

? otherwise, no truncation at all.?

? </description>?

</property>?
?

this is from the nutch-default.xml in 0.9, don't know whether it has changed in 1.0 .?
?

On Mar 3, 2009, at 8:27 PM, [email protected] wrote:?
?

?

Hi,?

?

I also noticed that we can disable storing content of pages which I > use. I wonder why someone needs to store content? Also, in case of > files, is there a way to tell nutch not to download the whole file > but let say 1000 bytes from the beginning and parse and index > information only in that part of files??

?

Thanks.?

A.?

?

?

?

?

?

?

?

-----Original Message-----?

From: yanky young <[email protected]>?

To: [email protected]?

Sent: Tue, 3 Mar 2009 6:41 pm?

Subject: Re: what is needed to index for about 10000 domains?

?

?

?

?

?

?

?

?

?

?

Hi:?

?

Dog cutting has ever write a wiki about hardware requirement of > nutch, you?

can check it out?

?

http://wiki.apache.org/nutch/HardwareRequirements?

?

good luck?

?

yanky?

?

?

2009/3/4 John Martyniak <[email protected]>?

?

Regarding the machine, you could run it on anything, it all >> depends what?

kind of performance you want. So yes you could run it on the >> machine that?

you have or something like the linux machine that I have. And the >> DSL?

connection should be fine, you just need to make sure that it stays >> up the?

whole time, because if not it will start erring out, and you will >> have to?

re-fetch that whole segment as there is no way to pick up from >> where you?

left off.?

?

The only reason that I merged the segments was that I had many of >> them, and?

I wanted to build a big one before I started creating new ones, >> another?

advantage of merging is that you can use that to clear out unwanted >> urls.?

For example I had a bunch of .js files in there, that I didn't want >> to have?

as part of the index, so I cleared them out.?

?

I used "bin/nutch mergesegs".?

?

Regarding merging the other parts, I have never used, but I don't >> think?

that it is necessary unless you have multiple linkdbs, etc. in my >> case I do?

not.?

?

-John?

?

?

?

?

On Mar 3, 2009, at 7:14 PM, [email protected] wrote:?

?

?

Hi,?

?

I will need to index all links in domains then. What do you think >>> a linux?

box like yours with DSL connection is OK to index the domains I >>> have??

?

Why only segments? I thought we need to merge all sub folders >>> under crawl?

folder. What did you use for merging them??

?

Thanks.?

A.?

?

?

?

?

?

?

?

-----Original Message-----?

From: John Martyniak <[email protected]>?

To: [email protected]?

Sent: Tue, 3 Mar 2009 3:21 pm?

Subject: Re: what is needed to  index for about 10000 domains?

?

?

?

?

?

?

?

?

?

Well the way that nutch works is that you would inject your list of?

domains into the DB, and that would be the starting point. Since >>> nutch uses?

a crawler it would grab those pages, and determine if there are >>> any links on?

those pages, and then add them to the DB. So the next time that you?

generated your urls to fetch, it would take your original list, >>> plus the?

ones that it found to generate the new segment.??

??

?

If you wanted to limit it to only pages contained on your 10000 >>> domains,?

you could use the regex-urlfilter.txt file in the conf directory >>> to limit it?

to your list. But you would have to create a regular expression >>> for each?

one.??

??

?

I am not familiar with the merge script on the wiki, but have merged?

segments before and it did work. But that was on Linux, don't >>> think that?

should make a difference though.??

??

?

-John??

??

?

??

?

On Mar 3, 2009, at 5:10 PM, [email protected] wrote:??

??

?

??

?

?

Hi,??

?

?

??

?

?

Thanks for the reply. I have list? of those domains only. I am not >>> > sure?

how many pages they have. Is a DSL? connection sufficient to > >>>> run nutch in?

my case. Did you run nutch for all of your pages at > once or >>>> separately for?

a given subset of them. Btw, yesterday I > tried to use merge >>>> shell script?

that we have on wiki. It gave a lot > of errors. I run it on >>>> cygwin though.??

?

?

??

?

?

Thanks.??

?

?

A.??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

-----Original Message-----??

?

?

From: John Martyniak <[email protected]>??

?

?

To: [email protected]??

?

?

Sent: Tue, 3 Mar 2009 1:44 pm??

?

?

Subject: Re: what is needed to  index for about 10000 domains??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

?

I think that in order to answer that questions, it is necessary to >>> > know?

how many total pages are being indexed.???

?

?

???

?

?

??

?

?

I currently have ~3.5 million pages indexed, and the segment >?

directories are around 45GB, The response time is relatively >>>> fast.???

?

?

???

?

?

??

?

?

In the test site it is running on a dual processor Dell 1850 with >>> > 3GB?

of RAM.???

?

?

???

?

?

??

?

?

-John???

?

?

???

?

?

??

?

?

On Mar 3, 2009, at 3:44 PM, [email protected] wrote:???

?

?

???

?

?

??

?

?

Hello,???

?

?

??

?

?

???

?

?

??

?

?

I use nutch-0.9 and need to index about 10000? domains.? I want to >>> >> >?

know? minimum requirements to hardware and memory.???

?

?

??

?

?

???

?

?

??

?

?

Thanks in advance.???

?

?

??

?

?

Alex.???

?

?

???

?

?

??

?

?

??

?

?

??

?

?

??

?

?

??

?

??

?

?

?

?

?

?

?

?

?

?

?

?
?






Reply via email to