[
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ]
Stefan Groschupf commented on NUTCH-343:
Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to
[
http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12428922 ]
Stefan Groschupf commented on NUTCH-342:
We should cleanup logging in nutch in general asap!
The way things are configured by today is everything else than
[
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ]
Stefan Groschupf commented on NUTCH-347:
Please submit this patch!
Thanks!
Build: plugins' Jars not found
--
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried
about nutch's requirements to HDD free space.
For example, suppose I have
1) server with job tracker and namenode
2) 5 servers with task trackers and 20 Gb HDDs
3) 5 servers with datenode and 20 Gb HDDs also
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Stefan Groschupf updated NUTCH-341:
---
Attachment: doNotDeleteTmpIndexMergeDirV1.patch
+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Attachment: respectFetcherParsePropertyV1.patch
Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a
contributor to commit this to
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Priority: Major (was: Trivial)
Fetcher ignores the fetcher.parse value configured in config file
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]
Stefan Groschupf updated NUTCH-336:
---
Priority: Critical (was: Minor)
I think that is a fundamental problem since I observe there are many pages e.g.
presentation slides that have exactly
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428942 ]
Pascal Beis commented on NUTCH-345:
---
The DeflateUtils are called by HttpBase in the lib-http plugin, which in turn
is called by
HttpResponse in the protocol-http
Stefan Groschupf (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Stefan Groschupf resolved NUTCH-322.
Resolution: Duplicate
duplicate of NUTCH-353
??? If anything, NUTCH-353 is a duplicate of this issue, as it was
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Andrzej Bialecki reopened NUTCH-322:
-
Assignee: Andrzej Bialecki
Re-opening - this issue is not resolved yet.
Fetcher discards ProtocolStatus, doesn't store redirected
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428961 ]
Andrzej Bialecki commented on NUTCH-345:
-
Looks ok to me. Minor addition - protocol-httpclient Http.java and
HttpResponse.java should be modified too, to
互联无限是融合了近联网、无线与有线互联以及内容互联等先进理念的网络平台。互联无限主要提供手机建站、当地信息搜索,属近联网的概念。主要涉及当地的衣、食、住、行等各个方面的本地产品和服务信息,所以这项事业的发展和客户服务,应当以地方为主。其他门户网站与互联无限相比门户门槛高,一般加盟费为50万人们币;不具备无线互联不提供短信增值服务通道。
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Andrzej Bialecki updated NUTCH-341:
Attachment: patch-v2.txt
I propose another variant of this patch. This version allows you to run
multiple mergers at the same time, with the same
Hi there,
IŽm from Germany. My english isnŽt so good.
iŽm a beginner and i have a Question about Nutch.
I want to add a new Field Price (String) in the Database.
I need it because i want to search for prices from some products and i want to
sort the price in the result..
Can someone help
贵公司领导和财务
您好!
因我公司进项较多,每个月有剩余发票可向外代开,
而且还受多家公司的委托,代理代开各公司剩余发票的业务,
以减少不必要的损失,互惠互益以解贵司业务运作、补帐、做帐的燃眉之急。
本公司享有税收优惠政策,拥有国家税务优惠固定月税,
长期与国内各省市多家企业合作,在报税、做帐方面积累有丰富的经验,
现在对外推出(代开发票)的业务,我们代理的行业广泛:
国内商品销售、增值税、建筑业、广告业、运输业、服务业……等,
税率较低,绝对是真票。
贵企业(公司)若有下列情况:
1.公司没有税务优惠政策而纳税偏高的;
2.对外销售商品而本公司暂时未领正式发票的;
贵公司领导和财务
您好!
因我公司进项较多,每个月有剩余发票可向外代开,
而且还受多家公司的委托,代理代开各公司剩余发票的业务,
以减少不必要的损失,互惠互益以解贵司业务运作、补帐、做帐的燃眉之急。
本公司享有税收优惠政策,拥有国家税务优惠固定月税,
长期与国内各省市多家企业合作,在报税、做帐方面积累有丰富的经验,
现在对外推出(代开发票)的业务,我们代理的行业广泛:
国内商品销售、增值税、建筑业、广告业、运输业、服务业……等,
税率较低,绝对是真票。
贵企业(公司)若有下列情况:
1.公司没有税务优惠政策而纳税偏高的;
2.对外销售商品而本公司暂时未领正式发票的;
[
http://issues.apache.org/jira/browse/NUTCH-341?page=comments#action_12429029 ]
Sami Siren commented on NUTCH-341:
--
+1 for v2
IndexMerger now deletes entire workingdir after completing
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429033 ]
Chris A. Mattmann commented on NUTCH-338:
-
Hi Andrzej,
A patch is available that you can apply quickly to remove the text parser as
an option for pdf.
[ http://issues.apache.org/jira/browse/NUTCH-347?page=all ]
Sami Siren resolved NUTCH-347.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Sami Siren
committed
Build: plugins' Jars not found
--
[
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12429035 ]
Chris A. Mattmann commented on NUTCH-258:
-
Hi Folks,
A patch is available on this issue. Has anyone who was experiencing the
original problem tried out
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]
Sami Siren resolved NUTCH-338.
--
Resolution: Fixed
This is now committed, thank you.
The patch was broken, hopefully I got it right.
Remove the text parser as an option for parsing PDF files in
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429042 ]
Chris A. Mattmann commented on NUTCH-338:
-
Hi Sami,
Thanks much. It's weird that it was broken seeing as it was a one line patch,
however, I tried it
[
http://issues.apache.org/jira/browse/NUTCH-338?page=comments#action_12429044 ]
Sami Siren commented on NUTCH-338:
--
yeah, svn diff from commandline is the winner.
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as neither of these ideas
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Andrzej Bialecki closed NUTCH-341.
---
Fix Version/s: 0.8.1
0.9.0
Resolution: Fixed
Fixed. Thanks!
IndexMerger now deletes entire workingdir after completing
By manually copying the the custom-meta directory in build/plugin to
plugin/ I was able to get at least some debug output in my log. It
doesn't really tell me much, any idea why it wouldn't be loading the
plugin when it has the correct entry in my nutch-site.xml?
2006-08-18 13:34:35,007 DEBUG
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to
Andrzej Bialecki wrote:
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
Sami Siren wrote:
Original motivation for this was http headers and meta tags, which
can have multiple values. Another case is the language
identification, where the same key may have multiple values, coming
from different sources. Additionally, MapWritable supports any
Writable, which is
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Greg Kim updated NUTCH-105:
---
Attachment: RobotRulesParser.patch
This patch will not cache the robots.txt on network errors/delays; currently
we cache EMPTY_RULES (allows everything) for a host X on
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Greg Kim updated NUTCH-105:
---
Affects Version/s: 0.8.1
0.9.0
Network error during robots.txt fetch causes file to be ignored
hi...
i'm playing around with an app that parses websites and extracts
information, returning certain information to my system.
my primary issue has to do with how i might architect the system to place
the information into my database. i'm using/testing with mysql. my question
has to do with how
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]
Sami Siren updated NUTCH-338:
-
Fix Version/s: 0.8.1
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
35 matches
Mail list logo