I don't think there is a standard way to tell Nutch to download the
end of a file (since you have to download the beginning before you get
to the end).
But if I understand what you want, you could write a custom
IndexFilter that indexes only the tail of a file, see http://wiki.apache.org/nutch/WritingPluginExample-0.9
for a start on how to do this.
On Mar 3, 2009, at 8:48 PM, [email protected] wrote:
I never tried to test this configuration. What about asking nutch to
download a certain amount of byes from the end of files?
-----Original Message-----
From: Jasper Kamperman <[email protected]>
To: [email protected]
Sent: Tue, 3 Mar 2009 8:32 pm
Subject: Re: what is needed to index for about 10000 domains
One reason to store content is if you want to show snippets in
search results.?
?
Another reason is if you want to have a "cached" feature where you
can give the user the page as it looked when you crawled it (it may
since have disappeared.?
?
There is a way to tell nutch to look at only the beginning of a
file, it's this section in your config.xml:?
?
<property>?
? <name>file.content.limit</name>?
? <value>65536</value>?
? <description>The length limit for downloaded content, in bytes.?
? If this value is nonnegative (>=0), content longer than it will be
truncated;?
? otherwise, no truncation at all.?
? </description>?
</property>?
?
this is from the nutch-default.xml in 0.9, don't know whether it has
changed in 1.0 .?
?
On Mar 3, 2009, at 8:27 PM, [email protected] wrote:?
?
?
Hi,?
?
I also noticed that we can disable storing content of pages which I
> use. I wonder why someone needs to store content? Also, in case
of > files, is there a way to tell nutch not to download the whole
file > but let say 1000 bytes from the beginning and parse and
index > information only in that part of files??
?
Thanks.?
A.?
?
?
?
?
?
?
?
-----Original Message-----?
From: yanky young <[email protected]>?
To: [email protected]?
Sent: Tue, 3 Mar 2009 6:41 pm?
Subject: Re: what is needed to index for about 10000 domains?
?
?
?
?
?
?
?
?
?
?
Hi:?
?
Dog cutting has ever write a wiki about hardware requirement of >
nutch, you?
can check it out?
?
http://wiki.apache.org/nutch/HardwareRequirements?
?
good luck?
?
yanky?
?
?
2009/3/4 John Martyniak <[email protected]>?
?
Regarding the machine, you could run it on anything, it all >>
depends what?
kind of performance you want. So yes you could run it on the >>
machine that?
you have or something like the linux machine that I have. And the
>> DSL?
connection should be fine, you just need to make sure that it
stays >> up the?
whole time, because if not it will start erring out, and you will
>> have to?
re-fetch that whole segment as there is no way to pick up from >>
where you?
left off.?
?
The only reason that I merged the segments was that I had many of
>> them, and?
I wanted to build a big one before I started creating new ones, >>
another?
advantage of merging is that you can use that to clear out
unwanted >> urls.?
For example I had a bunch of .js files in there, that I didn't
want >> to have?
as part of the index, so I cleared them out.?
?
I used "bin/nutch mergesegs".?
?
Regarding merging the other parts, I have never used, but I don't
>> think?
that it is necessary unless you have multiple linkdbs, etc. in my
>> case I do?
not.?
?
-John?
?
?
?
?
On Mar 3, 2009, at 7:14 PM, [email protected] wrote:?
?
?
Hi,?
?
I will need to index all links in domains then. What do you think
>>> a linux?
box like yours with DSL connection is OK to index the domains I
>>> have??
?
Why only segments? I thought we need to merge all sub folders >>>
under crawl?
folder. What did you use for merging them??
?
Thanks.?
A.?
?
?
?
?
?
?
?
-----Original Message-----?
From: John Martyniak <[email protected]>?
To: [email protected]?
Sent: Tue, 3 Mar 2009 3:21 pm?
Subject: Re: what is needed to index for about 10000 domains?
?
?
?
?
?
?
?
?
?
Well the way that nutch works is that you would inject your list
of?
domains into the DB, and that would be the starting point. Since
>>> nutch uses?
a crawler it would grab those pages, and determine if there are
>>> any links on?
those pages, and then add them to the DB. So the next time that
you?
generated your urls to fetch, it would take your original list,
>>> plus the?
ones that it found to generate the new segment.??
??
?
If you wanted to limit it to only pages contained on your 10000
>>> domains,?
you could use the regex-urlfilter.txt file in the conf directory
>>> to limit it?
to your list. But you would have to create a regular expression
>>> for each?
one.??
??
?
I am not familiar with the merge script on the wiki, but have
merged?
segments before and it did work. But that was on Linux, don't
>>> think that?
should make a difference though.??
??
?
-John??
??
?
??
?
On Mar 3, 2009, at 5:10 PM, [email protected] wrote:??
??
?
??
?
?
Hi,??
?
?
??
?
?
Thanks for the reply. I have list? of those domains only. I am
not >>> > sure?
how many pages they have. Is a DSL? connection sufficient to >
>>>> run nutch in?
my case. Did you run nutch for all of your pages at > once or
>>>> separately for?
a given subset of them. Btw, yesterday I > tried to use merge
>>>> shell script?
that we have on wiki. It gave a lot > of errors. I run it on
>>>> cygwin though.??
?
?
??
?
?
Thanks.??
?
?
A.??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
-----Original Message-----??
?
?
From: John Martyniak <[email protected]>??
?
?
To: [email protected]??
?
?
Sent: Tue, 3 Mar 2009 1:44 pm??
?
?
Subject: Re: what is needed to index for about 10000 domains??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
?
I think that in order to answer that questions, it is necessary
to >>> > know?
how many total pages are being indexed.???
?
?
???
?
?
??
?
?
I currently have ~3.5 million pages indexed, and the segment >?
directories are around 45GB, The response time is relatively
>>>> fast.???
?
?
???
?
?
??
?
?
In the test site it is running on a dual processor Dell 1850 with
>>> > 3GB?
of RAM.???
?
?
???
?
?
??
?
?
-John???
?
?
???
?
?
??
?
?
On Mar 3, 2009, at 3:44 PM, [email protected] wrote:???
?
?
???
?
?
??
?
?
Hello,???
?
?
??
?
?
???
?
?
??
?
?
I use nutch-0.9 and need to index about 10000? domains.? I want
to >>> >> >?
know? minimum requirements to hardware and memory.???
?
?
??
?
?
???
?
?
??
?
?
Thanks in advance.???
?
?
??
?
?
Alex.???
?
?
???
?
?
??
?
?
??
?
?
??
?
?
??
?
?
??
?
??
?
?
?
?
?
?
?
?
?
?
?
?
?