Re: [Nutch-dev] Next Nutch release

2007-01-19 Thread Stefan Groschupf
Hi Andrzej, thank you for taking the time to comment, I highly value your comments. * I guess that for each case where Nutch seems inappropriate I could give you a counter-example of Nutch being used commercially with much success. I guess it depends on a particular application and

Re: [Nutch-dev] Next Nutch release

2007-01-18 Thread Stefan Groschupf
Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about to adopt using velocity/freemarker with servlet? In general I agree it is the view layer and should

Re: [Nutch-dev] Next Nutch release

2007-01-18 Thread Stefan Groschupf
working with hadoop he/ she should feel free to update the patch and post it in the hadoop jira. Stefan On 18.01.2007, at 15:39, Doug Cutting wrote: Stefan Groschupf wrote: We run the gui in several production environemnts with patched hadoop code - since this is from our point of view

Re: [Nutch-dev] Next Nutch release

2007-01-18 Thread Stefan Groschupf
dont know it is the right time to do this job. On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, I just finished reading all source code about nutch gui. And personally i don't like putting a lot of code snippets into jsp files since it takes a lot time when refactoring. So how about

Re: [Nutch-dev] Next Nutch release

2007-01-17 Thread Stefan Groschupf
Hi, great to hear people still working on things. It shows once more getting something in early would save some effort. :) Just some random comments. We run the gui in several production environemnts with patched hadoop code - since this is from our point of view the clean approach.

Re: [Nutch-dev] [jira] Created: (NUTCH-408) Plugin development documentation

2006-11-25 Thread Stefan Groschupf
did you erver browse this: http://wiki.media-style.com/display/ nutchDocu/Home Nothing big, but it will give you some ideas, also about plugins. On 25.11.2006, at 06:32, Armel T. Nene wrote: I agree with you that documentation is vital not the just extending the current version but also for

Re: [Nutch-dev] What's the status of Nutch-GUI?

2006-11-21 Thread Stefan Groschupf
I guess non-official hadoop jar is out of the question (as it goes on so rapidly). What are the modifications required, couldn't we start without them? Well than we would have a admin gui that does not work for local installation but only for distributed installations. See:

Re: [Nutch-dev] Fetcher freezes

2006-11-03 Thread Stefan Groschupf
Hi, try to have no regular expression filter and check if this helps. Let me know if this solve the problem. You may be want to do a thread dump and send the log to the list to check where exactly the fetcher freezes. Stefan Am 03.11.2006 um 15:53 schrieb Aisha: Hi, I don't know why but

Re: [Nutch-dev] How could I test my modify to NutchAnalysis.jj?

2006-09-10 Thread Stefan Groschupf
There is a eclipse java cc plugin. It compiles your the grammar and you can write easily test code. However it has it's own issues so you may just want to generate the java files with the nutch ant script and write than unit tests again these files. HTH Stefan On 10.09.2006, at 00:49, heack

Re: [Nutch-dev] Patch Available status?

2006-08-31 Thread Stefan Groschupf
Another alternative would be to construct a new workflow that just adds the Patch Available status and still permits issues to be re- opened. +1 - Using Tomcat but need to do more? Need to support web services,

Re: [Nutch-dev] Missing pages anchor text

2006-08-29 Thread Stefan Groschupf
Hi Doug, I'm pretty sure that your problem is related to the deduping of your index. In general the hash of the content of a page is used as key for the dedub tool. We ran into the the forwarding problem also in a other case. https://issues.apache.org/jira/browse/NUTCH-353 So may be we

Re: [Nutch-dev] [Nutch Wiki] Update of RunNutchInEclipse by UrosG

2006-08-29 Thread Stefan Groschupf
Hi, + You may have problems with some imports in parse-mp3 and parse- rtf plugins. Because of incompatibility with apache licence they were left from sources. You can find it here: + + http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/ lib/ + +

Re: [Nutch-dev] Checking if crawl dir exists ...

2006-08-25 Thread Stefan Groschupf
Hi Michi, what is your motivation for that? Stefan Am 25.08.2006 um 06:52 schrieb Michael Wechner: Hi I think it would be very useful if the NutchBean would check if the crawl dir exists and throw at least a warning in case it doesn't: Index:

Re: [Nutch-dev] [Fwd: Re: [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet]

2006-08-24 Thread Stefan Groschupf
Hi Renaud, I think you were meaning editing: http://wiki.apache.org/nutch/ RunNutchInEclipse , not http://wiki.apache.org/nutch/ RenaudRichardet , right? Right! Sorry for the misunderstanding.I have no idea about your personal page so it would be a bad move to edit it. :-) Thanks again

Re: [Nutch-dev] [Nutch Wiki] Update of RenaudRichardet by RenaudRichardet

2006-08-23 Thread Stefan Groschupf
Hi Renaud, I updated your page with some more details, I hope that is ok for you. Thanks for creating it. Stefan Am 23.08.2006 um 11:51 schrieb Apache Wiki: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has

Re: [Nutch-dev] Junit testing, was: Re: [jira] Updated: (NUTCH-357) crawling simulation

2006-08-22 Thread Stefan Groschupf
One must also remember that proper junit testing can be used to verify functionality. There's lot of code currently that is not guarded by unit tests and I hereby invite everybody to participate in this endless effort and make Nutch unit tests better ;) I completely agree!!! Nutch has

[Nutch-dev] [jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] Stefan Groschupf commented on NUTCH-354: Since this issue is already closed I can not attach the patch file, so I attach it as text within this comment

[Nutch-dev] Fwd: [webspam-announces] Web Spam Collection Announced

2006-08-21 Thread Stefan Groschupf
Hi, May be some people will find that posting interesting. Webspam is one of the biggest issues or nutch for whole web crawls from my POV. Greetings, Stefan During AIRWeb'06 we announced the availability of the collection. We are currently planning a Web Spam challenge based on the

[Nutch-dev] [jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] Stefan Groschupf commented on NUTCH-356: Hi Enrico, there will be as much PluginRepositories as Configuration objects. So in case you create many

[Nutch-dev] [jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong

[Nutch-dev] [jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. MapWritable, nextEntry is not reset when Entries

[Nutch-dev] [jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] Stefan Groschupf commented on NUTCH-343: Thanks for the contribution, also that your patch has a test. :-) Just a small comment from taking a first look

[Nutch-dev] [jira] Commented: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12428922 ] Stefan Groschupf commented on NUTCH-342: We should cleanup logging in nutch in general asap! The way things are configured by today is everything else than

[Nutch-dev] [jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] Stefan Groschupf commented on NUTCH-347: Please submit this patch! Thanks! Build: plugins' Jars not found

[Nutch-dev] [jira] Updated: (NUTCH-341) IndexMerger now deletes entire workingdir after completing

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ] Stefan Groschupf updated NUTCH-341: --- Attachment: doNotDeleteTmpIndexMergeDirV1.patch +1. I agree it makes completly no sense to be required creating a tmp folder manually and nutch deletes

[Nutch-dev] [jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Attachment: respectFetcherParsePropertyV1.patch Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a contributor to commit

[Nutch-dev] [jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ] Stefan Groschupf updated NUTCH-337: --- Priority: Major (was: Trivial) Fetcher ignores the fetcher.parse value configured in config file

[Nutch-dev] [jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-18 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ] Stefan Groschupf updated NUTCH-336: --- Priority: Critical (was: Minor) I think that is a fundamental problem since I observe there are many pages e.g. presentation slides that have exactly

[Nutch-dev] [jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ] Stefan Groschupf updated NUTCH-350: --- Attachment: protocolRetryV5.patch This patch will dramatically increase the number of successfully fetched pages of a intranet crawl over the time

[Nutch-dev] [jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] Stefan Groschupf commented on NUTCH-322: I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output

[Nutch-dev] [jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
: 0.8.1, 0.9.0 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1 Attachments: doNotRefecthForwarderPagesV1.patch Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime

[Nutch-dev] [jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ] Stefan Groschupf resolved NUTCH-322. Resolution: Duplicate duplicate of NUTCH-353 Fetcher discards ProtocolStatus, doesn't store redirected pages

[Nutch-dev] [jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ] Stefan Groschupf updated NUTCH-353: --- Attachment: doNotRefecthForwarderPagesV1.patch Since we discussed that nutch need to be more polite we should fix that asap. pages that serverside

[Nutch-dev] [jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] Stefan Groschupf commented on NUTCH-346: +1 I agree, can you please create a patch file and attach it to this bug. Thanks Improve readability of logs

[Nutch-dev] [jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] Stefan Groschupf commented on NUTCH-345: Shouldn't the DeflateUtils also be part of the protocol-http plugin? Also since it is a larger contribution

[Nutch-dev] [jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] Stefan Groschupf commented on NUTCH-349: my vote goes to #2. Having a tool that need to be started manually would be better than complicate the already

[Nutch-dev] [jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ] Stefan Groschupf updated NUTCH-348: --- Attachment: sortPatchV1.patch What people think about this kind of solution? Generator is building fetch list using *lowest* scoring URLs

[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls

[Nutch-dev] [jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-28 Thread Stefan Groschupf (JIRA)
Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes

[Nutch-dev] [jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] Stefan Groschupf commented on NUTCH-318: Yes this happens only in a distributed environment. Please also see my last mail in the hadoop dev list. I think

[Nutch-dev] [jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-26 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] Stefan Groschupf commented on NUTCH-318: Shouldn't that be fixed in .8 since by today this tool just produce no output?! log4j not proper configured

[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages

[Nutch-dev] segread vs. readseg

2006-07-24 Thread Stefan Groschupf
Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Stefan - Take Surveys. Earn Cash. Influence the Future

Re: [Nutch-dev] segread vs. readseg

2006-07-24 Thread Stefan Groschupf
I like it! Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki: Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just

Re: [Nutch-dev] tests failing

2006-07-23 Thread Stefan Groschupf
Hi Sami, I can not confirm this problem: jacy:~/nutch-trunk-tmp joa$ svn update . At revision 424865. [...] test: BUILD SUCCESSFUL Total time: 2 minutes 6 seconds So it works for me. Stefan Am 23.07.2006 um 13:27 schrieb Sami Siren: Svn trunk gave me failed on testcase

[Nutch-dev] result comparison tool?

2006-07-23 Thread Stefan Groschupf
Hi, I remember there was a search result comparison tool within nutch. Is that still alive? How to use it / find it? I was not able to find it by browsing the trunk sources. Is there any such a tool people can suggest to compare search results with yahoo or google result to play with

[Nutch-dev] [jira] Created: (NUTCH-329) CrawlDbReader processTopNJob does not set jobNames

2006-07-23 Thread Stefan Groschupf (JIRA)
Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev processTopNJob runs two job and both have no jobname setted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http

[Nutch-dev] [jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-22 Thread Stefan Groschupf (JIRA)
-325 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev In URLFilters constructor we use an array as long as we have filters defined in the urlfilter.order property

[Nutch-dev] [jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-22 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ] Stefan Groschupf updated NUTCH-325: --- Attachment: UrlFiltersNPE.patch A patch that uses a Arralist instead of an array and put only entries into the list when the entry is not null. Means

[Nutch-dev] log when blocked by robots.txt

2006-07-22 Thread Stefan Groschupf
Hi Developers, another thing in the discussion to be more polite. I suggest that we log a message in case an requested URL was blocked by a robots.txt. Optimal would be if we only log this message in case the current used agent name is only blocked and it is not a general blocking of all

[Nutch-dev] nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf
Hi developers, in nutch-default.xml property plugin.includes we say: In any case you need at least include the nutch-extensionpoints plugin. But we do not include it by default. valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-

Re: [Nutch-dev] nutch-extensionpoints not in plugin.includes

2006-07-20 Thread Stefan Groschupf
I may - but since you know the details of the plugin subsystem, tell me what _should_ be there? I.e. should we really include it in the plugin.includes list, or not? This is a philosophically question. I personal prefer restrict definitions, since applications behavior is better

[Nutch-dev] [jira] Created: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to a other. Also a reference of the MapWritable is passed. Means both

[Nutch-dev] [jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ] Stefan Groschupf updated NUTCH-323: --- Attachment: MapWritableCopyConstructor.patch Attached patch add a copy constructor to the map writable and use it in the CrawlDatum.set methode. However

[Nutch-dev] [jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
Components: fetcher Reporter: Stefan Groschupf Priority: Critical Configuration properties db.score.link.external and db.score.link.internal are ignored. In case of e.g. message board webpages or pages that have larger navigation menus on each page having a lower impact

[Nutch-dev] [jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Stefan Groschupf updated NUTCH-324: --- Attachment: InternalAndExternalLinkScoreFactor.patch Multiply the score of a page during distributeScoreToOutlink with db.score.link.internal

[Nutch-dev] [jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ] Stefan Groschupf resolved NUTCH-319. Resolution: Won't Fix Sorry, that is bogus since it is wriiten to the logging stream. OPICScoringFilter should use logging API instead

[Nutch-dev] db.max.inlinks

2006-07-18 Thread Stefan Groschupf
Hi, shouldn't db.max.inlinks be in the nutch-default.xml configuration? Stefan - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on

Re: [Nutch-dev] db.max.inlinks

2006-07-18 Thread Stefan Groschupf
Andrzej, ... in LinkDB line 114, in the configure method and it is used in line 168 and 176. Stefan Am 18.07.2006 um 16:02 schrieb Andrzej Bialecki: Stefan Groschupf wrote: Hi, shouldn't db.max.inlinks be in the nutch-default.xml configuration? Where this is used? -- Best regards

[Nutch-dev] OPICScoringFilter Metadata transport scores as String

2006-07-15 Thread Stefan Groschupf
Hi, OPICScoringFilter line 91: content.getMetadata().set(Fetcher.SCORE_KEY, + datum.getScore()); and line 96,102 we set and get the Fetch Sore as Strings. :-o. Wouldn't it be better to have the Metadata support floats as well instead of serializing and parsing strings? In general wouldn't it

[Nutch-dev] [jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
Affects Versions: 0.8-dev Reporter: Stefan Groschupf Assigned To: Andrzej Bialecki Priority: Trivial Fix For: 0.8-dev OPICScoringFilter line 107 should be a logging not a e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it? -- This message

Re: [Nutch-dev] Crawl error

2006-07-10 Thread Stefan Groschupf
changes in verions 0.8. The problem is the log message does not say what file is not found. So, it's hard to debug. Any idea? Thanks, AJ On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Try to put the conf folder to your classpath in eclipse and set the environemnt variables

[Nutch-dev] [jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
: Stefan Groschupf Priority: Critical Fix For: 0.8-dev In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps

Re: [Nutch-dev] Crawl error

2006-07-09 Thread Stefan Groschupf
Try to put the conf folder to your classpath in eclipse and set the environemnt variables that are setted in bin/nutch. Btw, please do not crosspost. Thanks. Stefan Am 09.07.2006 um 21:47 schrieb AJ Chen: I checked out the 0.8 code from trunk and tried to set it up in eclipse. When

Re: [Nutch-dev] Nutch based directory and crawler based on keyword

2006-07-09 Thread Stefan Groschupf
Hi, this question is difficult to answer and may be there more experts in the nutch user list than in the developer list. In nutch 0.8 you can use the new scoring api to change the scoring of a page for being scheduled for crawling based on the it's scores. Have a look to the opic score

Re: [Nutch-dev] Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf
Hi Jérôme, I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. We should fix that. Stefan On 06.07.2006, at 08:54, Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to

Re: [Nutch-dev] Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf
We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) and this worked without any problem. Thanks for catching that, this saved us a lot of time. Stefan On 07.07.2006, at 16:08, Jérôme Charron wrote: I have the same problem on

Re: [Nutch-dev] 0.8 release

2006-07-05 Thread Stefan Groschupf
+1, but I really would love to see NUTCH-293 as part of nutch .8 since this all about being more polite. Thanks. Stefan On 05.07.2006, at 03:46, Doug Cutting wrote: +1 Piotr Kosiorowski wrote: +1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8

[Nutch-dev] noindedo not index/noindex

2006-06-22 Thread Stefan Groschupf
Hi, as far I can see nutch's html parser does only support the meta tag noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there is an inoffiziel html noindex tag. http://www.webmasterworld.com/forum10003/2703.htm May be this would be another thing to make nutch more polite. Also please

[Nutch-dev] [jira] Created: (NUTCH-307) wrong configured log4j.properties

2006-06-19 Thread Stefan Groschupf (JIRA)
wrong configured log4j.properties - Key: NUTCH-307 URL: http://issues.apache.org/jira/browse/NUTCH-307 Project: Nutch Type: Bug Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev In nutch/conf is only one

Re: [Nutch-dev] how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-12 Thread Stefan Groschupf
Hi Feng, map Writrable is a kind of hashmap. You can put in any key value pair, but the key and values need to be Writables: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/ Writable.html You can use UTF8 as StingKey and Value or ByteWritable as key and Utf8 as Values. Etc.

Re: [Nutch-dev] nutch-default.xml configuration

2006-06-12 Thread Stefan Groschupf
Hi Lourival, this means all pages older than 30 days are potential candidates for a fetch list that is created by segment generation process. Stefan Am 12.06.2006 um 16:33 schrieb Lourival Júnior: Hi all! I have a question about nutch-default.xml configuration file. There is a

[Nutch-dev] [jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV5.patch Release Candidate 1 of this patch. This patch contains: + add IP Address to CrawlDatum Version 5 (as byte[4

[Nutch-dev] [jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV4.patch Attached a patch that does only use any time 4 byte for the ip. Means we do ignore ipv6. This save us a 4 byte

[Nutch-dev] [jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] Stefan Groschupf commented on NUTCH-293: Any comments? There was already a posting in the nutch agent mailing list, where someone had banned nutch since nutch does

Re: [Nutch-dev] svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Stefan Groschupf
As far I understand hadoop use commons logging. Should we switch to use commons logging as well? Am 06.06.2006 um 11:02 schrieb Jérôme Charron: URL: http://svn.apache.org/viewvc?rev=411943view=rev Log: Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons Logging, configured

[Nutch-dev] classloading problem hadoop .3.1

2006-06-06 Thread Stefan Groschupf
Hi, is there a known problem with hadop .3.1 and nutch classloading or job file usage? I wrote a custom tool and want to start it via: bin/nutch myclass crawldb 1000 But found only following exception in the task reporter messages: java.lang.RuntimeException: java.lang.RuntimeException:

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code

[Nutch-dev] [jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Stefan Groschupf updated NUTCH-289: --- Attachment: ipInCrawlDatumDraftV1.patch To keep the discussion alive attached a _first draft_ for storing the ip in the crawlDatum for public discussion

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf
I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get

[Nutch-dev] [jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a 404 for a robots.txt is returned no page is fetched at all from the host

[Nutch-dev] search engine spam detector

2006-06-04 Thread Stefan Groschupf
Hi, a interesting tool: http://tool.motoricerca.info/spam-detector/ Stefan ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] search engine spam detector

2006-06-04 Thread Stefan Groschupf
The idea to have someething like this as a nutch-module (dropping pages or ranking them very low) might come up :-) This will be a very long way. I collect some thoughts and a list of web spam related papers in my blog. http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072-

[Nutch-dev] [jira] Created: (NUTCH-297) sandbox svn folder

2006-06-03 Thread Stefan Groschupf (JIRA)
sandbox svn folder -- Key: NUTCH-297 URL: http://issues.apache.org/jira/browse/NUTCH-297 Project: Nutch Type: Sub-task Reporter: Stefan Groschupf Assigned to: Doug Cutting Priority: Trivial Having a svn sandbox repository would allow

[Nutch-dev] [jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
: Stefan Groschupf Fix For: 0.8-dev What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created

[Nutch-dev] [jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Stefan Groschupf updated NUTCH-298: --- Attachment: fixNpeRobotRuleSet.patch fix the npe in RobotRuleSet happen in case we use a empthy RuleSet if a 404 for a robots.txt is returned no page

[Nutch-dev] RobotRuleSet

2006-06-03 Thread Stefan Groschupf
Hi, just posted a fix for a NPE in case a empty RobotRuleSet is used. The patch only contains a two lines fix, since I learned that this best way to get things committed sooner. :) However I really don't like the RobotRuleSet implementation since entries are copied between a arraylist and a

[Nutch-dev] [jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] Stefan Groschupf commented on NUTCH-282: Is that related to host grouping we discussed? Can we in this case close this bug? Showing too few results on a page (Paging

[Nutch-dev] [jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] Stefan Groschupf commented on NUTCH-286: This is difficult to realize since the http error code is readed from response in the fetcher and setted into the protocol

[Nutch-dev] [jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] Stefan Groschupf commented on NUTCH-292: +1, Can someone create a clean patch file? OpenSearchServlet: OutOfMemoryError: Java heap space

[Nutch-dev] [jira] Commented: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] Stefan Groschupf commented on NUTCH-291: lastModified will be only indexed if you switch on the index-more plugin. If you think you should change the way lastmodified

[Nutch-dev] [jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] Stefan Groschupf commented on NUTCH-290: If a parser throws an exeption: Fetcher, 261: try { parse = this.parseUtil.parse(content); parseStatus

[Nutch-dev] [jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ] Stefan Groschupf closed NUTCH-287: -- Resolution: Won't Fix http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html Exception when searching with sort

[Nutch-dev] [jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ] Stefan Groschupf closed NUTCH-284: -- Resolution: Won't Fix Yes, I was missing index-basic. NullPointerException during index - Key: NUTCH-284

[Nutch-dev] [jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] Stefan Groschupf commented on NUTCH-284: Please try discuss such things first in the user mailing list than open a issue. Maintaining the issue tracking is very time

[Nutch-dev] [jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] Stefan Groschupf commented on NUTCH-281: Can you submit a patch file? cached.jsp: base-href needs to be outside comments

[Nutch-dev] [jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414456 ] Stefan Groschupf commented on NUTCH-275: Should we switch off mime.type.magic by default? Some people was reporting the same problems. Fetcher not parsing XHTML

[Nutch-dev] [jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] Stefan Groschupf commented on NUTCH-274: Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the Injector? Empty row in/at end of URL-list

[Nutch-dev] [jira] Updated: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-274?page=all ] Stefan Groschupf updated NUTCH-274: --- Attachment: ignoreEmpthyLineDuringInjectV1.patch Ignore empthy lines during injecting. Thanks for spotting this Stefan! Empty row in/at end of URL-list

[Nutch-dev] [jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] Stefan Groschupf commented on NUTCH-290: As far I understand the code, the next parser is only used if the previous parser return with a unsuccessfully paring status

  1   2   3   4   5   6   7   >