Good call. That's another limit where it'd be nice to see a log
message when it's exceeded. I'll try to add a patch to NUTCH-182
tomorrow for this.
--Matt
On Jan 19, 2006, at 11:39 PM, Fuad Efendi wrote:
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is larger than zero, content longer than it will be
truncated; otherwise (zero or negative), no truncation at all.
</description>
</property>
(default is 65536)
-----Original Message-----
From: Jack Tang
Hi
pls change the value of "db.max.outlinks.per.page"(default is 100)
property to say 1000.
<property>
<name>db.max.outlinks.per.page</name>
<value>1000</value>
<description>The maximum number of outlinks that we'll process
for a page.
</description>
</property>
/Jack
On 1/20/06, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
Hi everyone,
I found that getOutlinks function in html-parser/
DOMContentUtils.java
doesn't work correctly for some cases. An example is this website:
http://blog.donews.com/boyla/. The function returns only 170 records,
while
in fact it contains a lot more (Firefox returns 356 links!).
When I compare the hyperlink list with the one returned by
Firefox, the
orders are exactly identical, meaning that the 170th link of
getOutlinks
function is the same as the 170th link of Firefox. Therefore, it
seems
that
the algorithm is correct, but there is some bug around. There is no
threshold at this point, since the max outlinks parameter is set at
updatedb
part. Even when I increase the max outlinks to 1000, the situation
still
remains.
Any suggestions are very appreciated.
Regards,
Giang
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
--
Matt Kangas / [EMAIL PROTECTED]