[Nutch-dev] [jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look to

[Nutch-dev] [jira] Commented: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12428922 ] Stefan Groschupf commented on NUTCH-342: We should cleanup logging in nutch in general asap! The way things are configured by today is everything else than

[Nutch-dev] [jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] Stefan Groschupf commented on NUTCH-347: Please submit this patch! Thanks! Build: plugins' Jars not found --

[Nutch-dev] some questions

2006-08-18 Thread anton
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried about nutch's requirements to HDD free space. For example, suppose I have 1) server with job tracker and namenode 2) 5 servers with task trackers and 20 Gb HDDs 3) 5 servers with datenode and 20 Gb HDDs also

[Nutch-dev] [jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes

[Nutch-dev] [jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit this to

[Nutch-dev] [jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file

[Nutch-dev] [jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ] Stefan Groschupf updated NUTCH-336: --- Priority: Critical (was: Minor) I think that is a fundamental problem since I observe there are many pages e.g. presentation slides that have exactly

[Nutch-dev] [jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-18 Thread Pascal Beis (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428942 ] Pascal Beis commented on NUTCH-345: --- The DeflateUtils are called by HttpBase in the lib-http plugin, which in turn is called by HttpResponse in the protocol-http

Re: [Nutch-dev] [jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-18 Thread Andrzej Bialecki
Stefan Groschupf (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 ??? If anything, NUTCH-353 is a duplicate of this issue, as it was

[Nutch-dev] [jira] Reopened: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Andrzej Bialecki reopened NUTCH-322: - Assignee: Andrzej Bialecki Re-opening - this issue is not resolved yet. Fetcher discards ProtocolStatus, doesn't store redirected

[Nutch-dev] [jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428961 ] Andrzej Bialecki commented on NUTCH-345: - Looks ok to me. Minor addition - protocol-httpclient Http.java and HttpResponse.java should be modified too, to

[Nutch-dev] 互联无限招募地方老板

2006-08-18 Thread asd
互联无限是融合了近联网、无线与有线互联以及内容互联等先进理念的网络平台。互联无限主要提供手机建站、当地信息搜索,属近联网的概念。主要涉及当地的衣、食、住、行等各个方面的本地产品和服务信息,所以这项事业的发展和客户服务,应当以地方为主。其他门户网站与互联无限相比门户门槛高,一般加盟费为50万人们币;不具备无线互联不提供短信增值服务通道。

[Nutch-dev] [jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Andrzej Bialecki updated NUTCH-341: Attachment: patch-v2.txt I propose another variant of this patch. This version allows you to run multiple mergers at the same time, with the same

[Nutch-dev] Adding Database Field

2006-08-18 Thread Levent Ulutas
Hi there, IŽm from Germany. My english isnŽt so good. iŽm a beginner and i have a Question about Nutch. I want to add a new Field Price (String) in the Database. I need it because i want to search for prices from some products and i want to sort the price in the result.. Can someone help

[Nutch-dev] 代开发票

2006-08-18 Thread 温先生
贵公司领导和财务 您好! 因我公司进项较多,每个月有剩余发票可向外代开, 而且还受多家公司的委托,代理代开各公司剩余发票的业务, 以减少不必要的损失,互惠互益以解贵司业务运作、补帐、做帐的燃眉之急。 本公司享有税收优惠政策,拥有国家税务优惠固定月税, 长期与国内各省市多家企业合作,在报税、做帐方面积累有丰富的经验, 现在对外推出(代开发票)的业务,我们代理的行业广泛: 国内商品销售、增值税、建筑业、广告业、运输业、服务业……等, 税率较低,绝对是真票。 贵企业(公司)若有下列情况: 1.公司没有税务优惠政策而纳税偏高的; 2.对外销售商品而本公司暂时未领正式发票的;

[Nutch-dev] 代开发票

2006-08-18 Thread 温先生
贵公司领导和财务 您好! 因我公司进项较多,每个月有剩余发票可向外代开, 而且还受多家公司的委托,代理代开各公司剩余发票的业务, 以减少不必要的损失,互惠互益以解贵司业务运作、补帐、做帐的燃眉之急。 本公司享有税收优惠政策,拥有国家税务优惠固定月税, 长期与国内各省市多家企业合作,在报税、做帐方面积累有丰富的经验, 现在对外推出(代开发票)的业务,我们代理的行业广泛: 国内商品销售、增值税、建筑业、广告业、运输业、服务业……等, 税率较低,绝对是真票。 贵企业(公司)若有下列情况: 1.公司没有税务优惠政策而纳税偏高的; 2.对外销售商品而本公司暂时未领正式发票的;

[Nutch-dev] [jira] Commented: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=comments#action_12429029 ] Sami Siren commented on NUTCH-341: -- +1 for v2 IndexMerger now deletes entire workingdir after completing

[Nutch-dev] [jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429033 ] Chris A. Mattmann commented on NUTCH-338: - Hi Andrzej, A patch is available that you can apply quickly to remove the text parser as an option for pdf.

[Nutch-dev] [jira] Resolved: (NUTCH-347) Build: plugins' Jars not found

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=all ] Sami Siren resolved NUTCH-347. -- Fix Version/s: 0.9.0 Resolution: Fixed Assignee: Sami Siren committed Build: plugins' Jars not found --

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12429035 ] Chris A. Mattmann commented on NUTCH-258: - Hi Folks, A patch is available on this issue. Has anyone who was experiencing the original problem tried out

[Nutch-dev] [jira] Resolved: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren resolved NUTCH-338. -- Resolution: Fixed This is now committed, thank you. The patch was broken, hopefully I got it right. Remove the text parser as an option for parsing PDF files in

[Nutch-dev] [jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429042 ] Chris A. Mattmann commented on NUTCH-338: - Hi Sami, Thanks much. It's weird that it was broken seeing as it was a one line patch, however, I tried it

[Nutch-dev] [jira] Commented: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429044 ] Sami Siren commented on NUTCH-338: -- yeah, svn diff from commandline is the winner. Remove the text parser as an option for parsing PDF files in parse-plugins.xml

Re: [Nutch-dev] Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as neither of these ideas

[Nutch-dev] [jira] Closed: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Andrzej Bialecki closed NUTCH-341. --- Fix Version/s: 0.8.1 0.9.0 Resolution: Fixed Fixed. Thanks! IndexMerger now deletes entire workingdir after completing

Re: [Nutch-dev] 0.8 not loading plugins

2006-08-18 Thread Chris Stephens
By manually copying the the custom-meta directory in build/plugin to plugin/ I was able to get at least some debug output in my log. It doesn't really tell me much, any idea why it wouldn't be loading the plugin when it has the correct entry in my nutch-site.xml? 2006-08-18 13:34:35,007 DEBUG

Re: [Nutch-dev] Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as

Re: [Nutch-dev] Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to

Re: [Nutch-dev] Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd

Re: [Nutch-dev] Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Original motivation for this was http headers and meta tags, which can have multiple values. Another case is the language identification, where the same key may have multiple values, coming from different sources. Additionally, MapWritable supports any Writable, which is

[Nutch-dev] [jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-08-18 Thread Greg Kim (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Greg Kim updated NUTCH-105: --- Attachment: RobotRulesParser.patch This patch will not cache the robots.txt on network errors/delays; currently we cache EMPTY_RULES (allows everything) for a host X on

[Nutch-dev] [jira] Updated: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-08-18 Thread Greg Kim (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Greg Kim updated NUTCH-105: --- Affects Version/s: 0.8.1 0.9.0 Network error during robots.txt fetch causes file to be ignored

[Nutch-dev] architecture question/thoughts

2006-08-18 Thread bruce
hi... i'm playing around with an app that parses websites and extracts information, returning certain information to my system. my primary issue has to do with how i might architect the system to place the information into my database. i'm using/testing with mysql. my question has to do with how

[Nutch-dev] [jira] Updated: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-08-18 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren updated NUTCH-338: - Fix Version/s: 0.8.1 Remove the text parser as an option for parsing PDF files in parse-plugins.xml