[Nutch-dev] 0

2006-06-05 Thread 中广公司
你好! 本公司从事税务代理,有国税、地税发票可优惠对外代开,所开出发票均可税务 验证抵扣后付款,有意者致电:13928434892 杨茂林(先生) ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] search engine spam detector

2006-06-05 Thread sboomer
Hello!!! -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Sunday, June 04, 2006 9:15 PM To: nutch-dev@lucene.apache.org Subject: search engine spam detector Hi, a interesting tool: http://tool.motoricerca.info/spam-detector/ Stefan

[Nutch-dev] summary

2006-06-05 Thread anton
My Nutch processed pages http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm. When I try make search for search term lingerie nutch bring up results with bad summary (... Lingerie, Lingerie, Lingerie,

Re: [Nutch-dev] summary

2006-06-05 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: My Nutch processed pages http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm. When I try make search for search term lingerie nutch bring up results with bad summary (... Lingerie,

Re: [Nutch-dev] summary

2006-06-05 Thread Sylvain FURMANEK
Hello, In your pages, you find the next Text: body topmargin=0 !-- Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie , Lingerie

[Nutch-dev] parse OutOfMemoryError?

2006-06-05 Thread Uygar Yüzsüren
Hi Everyone, I am using MapReduce and DFS for a crawl + index operation. When parsing relatively small segments (about 50,000 - 60,000 URLs), everything goes fine. But, when I try to parse a larger segment (600,000 - 700,000 URLs), my job is stopped by OutOfMemoryError at tasktrackers during the

Re: [Nutch-dev] search engine spam detector

2006-06-05 Thread Andrzej Bialecki
Stefan Groschupf wrote: The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) This will be a very long way. I collect some thoughts and a list of web spam related papers in my blog.

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Scott Ganyo (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414762 ] Scott Ganyo commented on NUTCH-258: --- For the record: I strongly object to closing this issue for the following reasons: 1) Having a *side-effect* of the entire system stop

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code.

Re: [Nutch-dev] [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Sami Siren
It emulates a feature with same name from google appliance. http://www.google.com/enterprise/mini/end_user_features.html -- Sami Siren [EMAIL PROTECTED] wrote: Hi, What exactly does this plugin do? I haven't seen it mentioned and the README.txt doesn't really describe it. Thanks, Otis

Re: [Nutch-dev] [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Andrzej Bialecki
Sami Siren wrote: It emulates a feature with same name from google appliance. http://www.google.com/enterprise/mini/end_user_features.html Are you sure there is no trademark infringement here? Perhaps we should call it something else, just to avoid any potential legal unpleasantries ... --

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann
Folks, Before I (or someone else) reopens the issue, I think it's important to understand the implications: 1) Having a *side-effect* of the entire system stop processing after merely logging a message at a certain event level is a poor practice. I'm not sure that the Fetcher quitting is a *

[Nutch-dev] [jira] Created: (NUTCH-300) Clustering API improvements

2006-06-05 Thread Andrzej Bialecki (JIRA)
Clustering API improvements --- Key: NUTCH-300 URL: http://issues.apache.org/jira/browse/NUTCH-300 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Priority: Minor This patch adds support for

[Nutch-dev] 优惠代开发票

2006-06-05 Thread 李娜
您好: 本公司因进项较多完成不了每月定税额度,为减少损失本公司现有部分结余普通发票可优惠对外代开, 代开范围:商品销售发票,广告发票,运输发票,其它服务发票,餐饮发票,建筑安装发票等, 本公司郑重承诺所用票据均为各单位 在税务局所申领,可上网查询或到税务局抵扣验证,普通发票收取2%,增值税收取6% 如贵公司在以下方面有需要的,我公司将为贵公司提供最方便的服务: 1.贵公司在进项或抵扣方面有差额的; 2.客户压低价,利润薄的; 3.采购时需要正规发票报销的; 4.其它涉税方面需要的. 如果贵公司对我司的发票有质疑的可以验证后再付款!

Re: [Nutch-dev] [Nutch-cvs] svn commit: r411594 - /lucene/nutch/trunk/contrib/web2/plugins/build.xml

2006-06-05 Thread Sami Siren
hmm... didn't think about that, are there more opinions about this? -- Sami Siren Are you sure there is no trademark infringement here? Perhaps we should call it something else, just to avoid any potential legal unpleasantries ... ___

[Nutch-dev] [jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV1.patch To keep the discussion alive attached a _first draft_ for storing the ip in the crawlDatum for public discussion.

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Andrzej Bialecki
Chris Mattmann wrote: Folks, Before I (or someone else) reopens the issue, I think it's important to understand the implications: I vote for re-opening. See below. 1) Having a *side-effect* of the entire system stop processing after merely logging a message at a certain event

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann
Hi Andrzej, The main problem, as Scott observed, is that the static flag affects all instances of the task executing inside the same JVM. If there are several Fetcher tasks (or any other tasks that check for SEVERE flag!), belonging to different jobs, all of them will quit. This is

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Andrzej Bialecki
Chris Mattmann wrote: +1 So, to summarize, the proposed resolution is: * add flag field in Configuration instance to signify whether or not a SEVERE error has been logged within a task's context Yes, preferably define this as a public static final String-s in NutchConfiguration, both

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf
I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get

[Nutch-dev] [jira] Updated: (NUTCH-300) Clustering API improvements

2006-06-05 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-300?page=all ] Andrzej Bialecki updated NUTCH-300: Attachment: patch.txt Clustering API improvements --- Key: NUTCH-300 URL:

[Nutch-dev] [jira] Reopened: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=all ] Chris A. Mattmann reopened NUTCH-258: - Assign To: Chris A. Mattmann Issue found to in fact be a real issue with the Fetcher: here's the proposed solution: * add flag field

[Nutch-dev] 意乐实业有限公司

2006-06-05 Thread 张先生
尊敬的负责人您好! 本公司每月都有剩余的发票特优惠代开如:普通发票。商品销售。海关代征税。建筑安装。服务. 内河运输。广告。电子。五金 。机械 。等等如有打扰请多多包涵谢谢! 如有需要请电:13266768808 联系人:张先生 意乐实业有限公司

[Nutch-dev] [jira] Resolved: (NUTCH-201) add support for subcollections

2006-06-05 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-201?page=all ] Sami Siren resolved NUTCH-201: -- Resolution: Fixed just committed this add support for subcollections -- Key: NUTCH-201 URL:

[Nutch-dev] [jira] Resolved: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-05 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Jerome Charron resolved NUTCH-298: -- Resolution: Fixed Committed + some unit tests to reproduce. Thanks Stefan. As you mentioned it in a previous mail, I agree that the RobotRulesParser