[
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508820
]
Sami Siren commented on NUTCH-392:
--
But why is parse_text_block's size so close to parse_text
data of parse_text
[
https://issues.apache.org/jira/browse/NUTCH-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508449
]
Sami Siren commented on NUTCH-499:
--
+1, seems good to me
Refactor LinkDb and LinkDbMerger to reuse code
[
https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508222
]
Sami Siren commented on NUTCH-434:
--
You missed one ObjectWritable in Indexer (the one that hit my head too hard
[
https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508239
]
Sami Siren commented on NUTCH-434:
--
Now there is a good chance that you knew all this :). If your point was that
[
https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren updated NUTCH-496:
-
Attachment: nutch-496.txt
This patch changes LanguageIdentifier to have NGramProfile per thread instead
[
https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501266
]
Sami Siren commented on NUTCH-496:
--
I believe the problem is even more severe. Now several threads share the
[
https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren updated NUTCH-161:
-
Fix Version/s: 1.0.0
Assignee: Sami Siren
Summary: Change Plain text parser to use
[
https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-161.
--
Resolution: Fixed
I just committed a fix for this, thanks KuroSaka!
Change Plain text parser to use
[
https://issues.apache.org/jira/browse/NUTCH-482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-482.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Remove redundant plugin lib-log4j
[
https://issues.apache.org/jira/browse/NUTCH-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-483.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Assignee: Sami Siren
remove redundant
[
https://issues.apache.org/jira/browse/NUTCH-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-457.
--
Resolution: Fixed
Create top level dist directory and checkin KEYS file to subversion be
standard
[
https://issues.apache.org/jira/browse/NUTCH-484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-484.
--
Resolution: Fixed
committed and updated site, thanks Gal
Nutch Nightly API link is broken in site
Remove redundant plugin lib-log4j
-
Key: NUTCH-482
URL: https://issues.apache.org/jira/browse/NUTCH-482
Project: Nutch
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Sami Siren
remove redundant commons-logging jar from ontology plugin
-
Key: NUTCH-483
URL: https://issues.apache.org/jira/browse/NUTCH-483
Project: Nutch
Issue Type: Bug
Affects Versions:
[
https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495229
]
Sami Siren commented on NUTCH-472:
--
Not sure how to turn source code in description into a patch file, but the
[
https://issues.apache.org/jira/browse/NUTCH-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-456.
--
Resolution: Fixed
committed with minor modifications (used StringBuilder instead of StringBuffer,
[
https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reassigned NUTCH-446:
Assignee: Sami Siren
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
[
https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-446.
--
Resolution: Fixed
I just committed this, keep the patches coming Doğacan!
RobotRulesParser should
[
https://issues.apache.org/jira/browse/NUTCH-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren updated NUTCH-469:
-
Attachment: NUTCH-469-2007-05-09.txt.gz
tnahks for putting this together, I briefly checked through the
[
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494531
]
Sami Siren commented on NUTCH-477:
--
I don't feel strongly about this but could enums be used instead of static
[
https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494534
]
Sami Siren commented on NUTCH-472:
--
have a patch?
NullPointerException in ZipTextExtractor if no MIME type for
[
https://issues.apache.org/jira/browse/NUTCH-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494537
]
Sami Siren commented on NUTCH-476:
--
md5 sum (or any other configurable digest) is already calculated in fetcher
or
[
https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492850
]
Sami Siren commented on NUTCH-446:
--
+1
RobotRulesParser should ignore Crawl-delay values of other bots in
[
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491305
]
Sami Siren commented on NUTCH-471:
--
Isn't the DCL declared to be broken?
We could perhaps instead instantiate
[
https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-473.
--
Resolution: Duplicate
duplicate of NUTCH-456
ExcelExtractor performance bad due to String
[
https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-432.
--
Resolution: Fixed
Fix Version/s: 0.9.0
The problem above has been fixed by ab.
JAVA_PLATFORM
[
https://issues.apache.org/jira/browse/NUTCH-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reopened NUTCH-432:
--
After this got applied there's this error printed on console when run on FC5:
bin/nutch: line 152:
[
https://issues.apache.org/jira/browse/NUTCH-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479419
]
Sami Siren commented on NUTCH-457:
--
+1
Create top level dist directory and checkin KEYS file to subversion be
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474466
]
Sami Siren commented on NUTCH-247:
--
I am not seeing how this would grow into multiple sets of checking rules.
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474257
]
Sami Siren commented on NUTCH-247:
--
Setting even a bogus agent name is an insignificant effort compared to the
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474269
]
Sami Siren commented on NUTCH-247:
--
I am OK with the efforts making things more user friendly but still doing
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473990
]
Sami Siren commented on NUTCH-247:
--
Agent name has actually only relevance in http. IMO not setting agent name
[
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473148
]
Sami Siren commented on NUTCH-443:
--
Didn't know this, will change this too. (Why is Nutch not using this class
in
[
https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467916
]
Sami Siren commented on NUTCH-258:
--
I haven't noticed this being a problem for me, so no objections from here.
[
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467491
]
Sami Siren commented on NUTCH-433:
--
ok, now it is committed, sorry.
java.io.EOFException in newer nightlies in
[
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reassigned NUTCH-433:
Assignee: Sami Siren
java.io.EOFException in newer nightlies in mergesegs or indexing from
Replace usage of ObjectWritable with something based on GenericWritable
---
Key: NUTCH-434
URL: https://issues.apache.org/jira/browse/NUTCH-434
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465493
]
Sami Siren commented on NUTCH-61:
-
Havent looked the patch (tm)
How would one manage segments after something linke
[
https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465540
]
Sami Siren commented on NUTCH-61:
-
ok, so in my usual use case where there are far more urls than I can fetch this
[
https://issues.apache.org/jira/browse/NUTCH-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-430.
--
Resolution: Fixed
Fix Version/s: 0.9.0
committed in revision 495732 with additional whitespace
integer overflow in HashComparator.compare
--
Key: NUTCH-430
URL: https://issues.apache.org/jira/browse/NUTCH-430
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions:
[
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464347
]
Sami Siren commented on NUTCH-422:
--
Is there a reason for the two takarta-regexp-jars (v 1.2 and 1.3) in source
[
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464351
]
Sami Siren commented on NUTCH-422:
--
couple of more points:
-source files use tabs for indentation
-headers of files
[
https://issues.apache.org/jira/browse/NUTCH-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-428.
--
Resolution: Fixed
Fix Version/s: 0.9.0
Most propably you dont have agent name configured in
[
https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463059
]
Sami Siren commented on NUTCH-420:
--
The feather 'Licensed for inclusion in ASF works' is missing from 2nd patch.
[
https://issues.apache.org/jira/browse/NUTCH-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-325.
--
Resolution: Fixed
just committed this with additional junit testcase. Thanks Stefan!
UrlFilters.java
[
https://issues.apache.org/jira/browse/NUTCH-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reassigned NUTCH-422:
Assignee: Sami Siren
index-extra plugin creates additional fields in the index, based on
[
https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren reassigned NUTCH-421:
Assignee: Sami Siren
Allow predeterminate running order of index filters
[
https://issues.apache.org/jira/browse/NUTCH-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-421.
--
Resolution: Fixed
Fix Version/s: 0.9.0
Thanks Alan,
I just committed this with additionali
[
http://issues.apache.org/jira/browse/NUTCH-418?page=comments#action_12460282 ]
Sami Siren commented on NUTCH-418:
--
We should perhaps include the rest of changes made in NUTCH-362.
Fixes parsing of XHTML (e.g. title)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=all ]
Sami Siren updated NUTCH-272:
-
Thanks Doug, that makes more sense now. Running URLFilters.filter() during
Generate seems very handy,
albeit costly for large crawls. (Should have an option to turn
[
http://issues.apache.org/jira/browse/NUTCH-415?page=comments#action_12458814 ]
Sami Siren commented on NUTCH-415:
--
Please also consider the performance implications. If this marking will add
signifigant performance overhead then it would be
[
http://issues.apache.org/jira/browse/NUTCH-248?page=comments#action_12457437 ]
Sami Siren commented on NUTCH-248:
--
Seems like the latest java has build in support
http://java.sun.com/javase/6/docs/api/java/net/IDN.html
add support for
[
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ]
Sami Siren commented on NUTCH-339:
--
perhaps thath exception is just a consequence of something other like this:
2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 -
[
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ]
Sami Siren commented on NUTCH-339:
--
I am running with 300 thread, and in parsing mode
thread dump shows:
191 threads waiting on condition
at
[
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453798 ]
Sami Siren commented on NUTCH-339:
--
When running a test fetch with Fetcher2 I enountered this error after fetching
few thousand pages (of 1 million segment):
[
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12452522 ]
Sami Siren commented on NUTCH-339:
--
patch applies ok, but there's this error when I try to compile:
compile:
[echo] Compiling plugin: lib-http
[javac]
[
http://issues.apache.org/jira/browse/NUTCH-251?page=comments#action_12452321 ]
Sami Siren commented on NUTCH-251:
--
Are you thinking of something like UI extension point like in contrib/web2 ?
not necessarily, that was also a quick hack I
Fix LinkDB Usage - implementation mismatch
--
Key: NUTCH-404
URL: http://issues.apache.org/jira/browse/NUTCH-404
Project: Nutch
Issue Type: Bug
Components: linkdb
Reporter: Sami
[ http://issues.apache.org/jira/browse/NUTCH-404?page=all ]
Sami Siren resolved NUTCH-404.
--
Fix Version/s: 0.9.0
Resolution: Fixed
fixed
Fix LinkDB Usage - implementation mismatch
--
Key:
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ]
Sami Siren resolved NUTCH-403.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Committed to trunk with change to name of conf parameter.
Make URL filtering optional in Generator
Make URL filtering optional in Generator
Key: NUTCH-403
URL: http://issues.apache.org/jira/browse/NUTCH-403
Project: Nutch
Issue Type: Improvement
Components: generator
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ]
Sami Siren updated NUTCH-403:
-
Attachment: nutch-generate-optional-filtering.patch
Attached patch adds option -noFilter to crawl command (and additional parameter
to java api) to control if
[ http://issues.apache.org/jira/browse/NUTCH-403?page=all ]
Sami Siren updated NUTCH-403:
-
The command that is altered is generate (Generator) not crawl.
Make URL filtering optional in Generator
Key:
[ http://issues.apache.org/jira/browse/NUTCH-388?page=all ]
Sami Siren resolved NUTCH-388.
--
Fix Version/s: 0.9.0
Resolution: Fixed
This is now fixed (rev 476617). Thanks for reporting it!
nutch-default.xml has outdated example for urlfilter.order
[
http://issues.apache.org/jira/browse/NUTCH-400?page=comments#action_12449440 ]
Sami Siren commented on NUTCH-400:
--
I updated headers and added missing headers to .java files in trunk.
There are still plenty of (.xml, .jsp, html, properties)
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
Sami Siren resolved NUTCH-395.
--
Fix Version/s: 0.9.0
Resolution: Fixed
applied to trunk with some additional whitespace changes.
Increase fetching speed
---
[
http://issues.apache.org/jira/browse/NUTCH-401?page=comments#action_12449485 ]
Sami Siren commented on NUTCH-401:
--
Shouldn't this directory be configurable? I found it because of permission
issues (/tmp isn't globally writable to catch stuff
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
Sami Siren updated NUTCH-395:
-
Attachment: NUTCH-395-trunk-metadata-only-2.patch
Additional change to Content cuts down time needed in effective fetching. Now
seeing speeds like 45 pages/sec also
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
Sami Siren updated NUTCH-395:
-
Affects Version/s: 0.9.0
Increase fetching speed
---
Key: NUTCH-395
URL:
[ http://issues.apache.org/jira/browse/NUTCH-395?page=all ]
Sami Siren updated NUTCH-395:
-
Attachment: NUTCH-395-trunk-metadata-only.patch
Here's a first stab at svn trunk version of nutch that just optimizes the use
of metadata and splits it into two
[
http://issues.apache.org/jira/browse/NUTCH-398?page=comments#action_12448949 ]
Sami Siren commented on NUTCH-398:
--
Did anyone try to use single machine but not with local mode but with nutch
acting like one node? Maybe this is workaround
Change CommandRunner to use concurrent api from jdk
---
Key: NUTCH-399
URL: http://issues.apache.org/jira/browse/NUTCH-399
Project: Nutch
Issue Type: Task
Reporter: Sami Siren
[ http://issues.apache.org/jira/browse/NUTCH-399?page=all ]
Sami Siren resolved NUTCH-399.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Change CommandRunner to use concurrent api from jdk
---
Update add missing license headers
Key: NUTCH-400
URL: http://issues.apache.org/jira/browse/NUTCH-400
Project: Nutch
Issue Type: Task
Affects Versions: 0.8.2, 0.9.0
Reporter: Sami Siren
[
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12448795 ]
Sami Siren commented on NUTCH-395:
--
have you measured what made the biggest impact on performance - changes to
Metadata, or
changes to IO in FetcherOutput?
did
[
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445956 ]
Sami Siren commented on NUTCH-395:
--
have you measured what made the biggest impact on performance - changes to
Metadata, or
changes to IO in FetcherOutput?
did
[
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445999 ]
Sami Siren commented on NUTCH-395:
--
settings. I.e. if someone created a segment with high max # of outlinks, you
should still be able
to read it and process
Increase fetching speed
---
Key: NUTCH-395
URL: http://issues.apache.org/jira/browse/NUTCH-395
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.8.1
Reporter: Sami
[ http://issues.apache.org/jira/browse/NUTCH-379?page=all ]
Sami Siren updated NUTCH-379:
-
Fix Version/s: (was: 0.8.1)
(was: 0.8)
cannot fix released versions
ParseUtil does not pass through the content's URL to the
Add methods to control runtime behaviour of NutchBean
-
Key: NUTCH-376
URL: http://issues.apache.org/jira/browse/NUTCH-376
Project: Nutch
Issue Type: Improvement
Affects Versions:
Link to 0.8.x apidocs broken on website
---
Key: NUTCH-375
URL: http://issues.apache.org/jira/browse/NUTCH-375
Project: Nutch
Issue Type: Bug
Components: documentation
Reporter: Sami
[ http://issues.apache.org/jira/browse/NUTCH-375?page=all ]
Sami Siren resolved NUTCH-375.
--
Resolution: Fixed
this was fixed by copying apidocs from 0.8.1 to
/www/lucene.apache.org/nutch/apidocs-0.8.x/
as soon as next rsync occurs it should be fine,
[
http://issues.apache.org/jira/browse/NUTCH-351?page=comments#action_12438013 ]
Sami Siren commented on NUTCH-351:
--
As the plugin name says it by using a protocol-forwardproxy acts as a protocol
plugin and does not need additional protocol
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]
Sami Siren closed NUTCH-266.
hadoop bug when doing updatedb
--
Key: NUTCH-266
URL: http://issues.apache.org/jira/browse/NUTCH-266
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Sami Siren closed NUTCH-105.
Network error during robots.txt fetch causes file to be ignored
---
Key: NUTCH-105
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]
Sami Siren closed NUTCH-338.
Remove the text parser as an option for parsing PDF files in parse-plugins.xml
--
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]
Sami Siren closed NUTCH-344.
Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-
Key:
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ]
Sami Siren closed NUTCH-318.
log4j not proper configured, readdb doesnt give any information
---
Key: NUTCH-318
[ http://issues.apache.org/jira/browse/NUTCH-370?page=all ]
Sami Siren closed NUTCH-370.
Resolution: Duplicate
actually this is a duplicate of #361
Generator looses urls when run with LocalJobRunner
--
[ http://issues.apache.org/jira/browse/NUTCH-370?page=all ]
Sami Siren updated NUTCH-370:
-
Summary: Generator looses urls when run with LocalJobRunner (was:
Generator loosed urls when run with LocalJobRunner)
Generator looses urls when run with
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]
Sami Siren resolved NUTCH-105.
--
Resolution: Fixed
This is now committed, thanks!
Network error during robots.txt fetch causes file to be ignored
[ http://issues.apache.org/jira/browse/NUTCH-367?page=all ]
Sami Siren resolved NUTCH-367.
--
Fix Version/s: 0.9.0
Resolution: Fixed
Assignee: Sami Siren
I just committed a fix for this together with testcase, thanks for reporting it.
[
http://issues.apache.org/jira/browse/NUTCH-368?page=comments#action_1243 ]
Sami Siren commented on NUTCH-368:
--
IMO a place for stuff like this is in hadoop more than nutch and i would like
to see this implemented there.
Mainly because i
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435175 ]
Sami Siren commented on NUTCH-365:
--
looks ok to me,
the ugly (with amp;) regexps could perhaps be put inside ![CDATA[ ]] elements
in generator there's
+ try {
+
Remove parse-text from unsupported filetypes in parse-plugins.xml
-
Key: NUTCH-362
URL: http://issues.apache.org/jira/browse/NUTCH-362
Project: Nutch
Issue Type: Bug
[
http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12433169 ]
Sami Siren commented on NUTCH-361:
--
The / by 0 was due to bug in testcase. Now the testcase fails about 50% of
time. I also noticed that the number of reduce
[
http://issues.apache.org/jira/browse/NUTCH-208?page=comments#action_12433175 ]
Sami Siren commented on NUTCH-208:
--
This looks like a good addition to Nutch, couple of comments:
-The added comments in HttpResponse should be removed.
-Any
[
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12433183 ]
Sami Siren commented on NUTCH-273:
--
+1 for not following redirects immediately - simplify fetcher logic.
I would also like to see a flexible (configurable?)
[
http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433185 ]
Sami Siren commented on NUTCH-339:
--
Andrzej,
are you still working with this or should I proceed as I originally planned ;)
Refactor nutch to allow fetcher
1 - 100 of 208 matches
Mail list logo